if (!require("gbm")) install.packages("gbm")
if (!require("pls")) install.packages("pls")
if (!require("caret")) install.packages("caret")
if (!require("ipred")) install.packages("ipred")
if (!require("party")) install.packages("party")
if (!require("rpart")) install.packages("rpart")
if (!require("RWeka")) install.packages("RWeka")
if (!require("Cubist")) install.packages("Cubist")
if (!require("mlbench")) install.packages("mlbench")
if (!require("partykit")) install.packages("partykit")
if (!require("elasticnet")) install.packages("elasticnet")
if (!require("randomForest")) install.packages("randomForest")
if (!require("AppliedPredictiveModeling")) install.packages("AppliedPredictiveModeling")
library(gbm)
library(pls)
library(caret)
library(ipred)
library(party)
library(rpart)
library(RWeka)
library(Cubist)
library(mlbench)
library(partykit)
library(elasticnet)
library(randomForest)
library(AppliedPredictiveModeling)
Note: Model descriptions are from Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Solutions are modifications of those posted by Max Kuhn on his public GitHub page. Function descriptions are from the RDocumentation website.
Single Decision Tree. Tree-based models generate a set of conditions that are highly interpretable and are easy to implement. They can effectively handle many types of predictors (sparse, skewed, continuous, categorical, etc.) without the need to pre-process them (imputation, feature selection, etc.). In addition, these models do not require the user to specify the form of the predictors relationship to the response like, for example, a linear regression model requires. There are many techniques for constructing regression trees. One of the oldest and most utilized is the Classification and Regression Tree (CART) methodology also known as recursive partitioning. For regression, the model begins with the entire data set and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups such that the overall sums of squares error are minimized. Then within each of [the] groups, this method searches for the predictor and split value that best reduces SSE until the number of samples in the splits falls below some threshold. The fully-grown tree is then pruned back to a potentially smaller depth [to mitigate over-fitting]. If a predictor is never used in a split, the prediction equation is independent of those data.
Single Decision Tree. Conditional Inference Trees (CIT) describe a unified framework for unbiased tree-based models for regression, classification, and other scenarios. Statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points. For a candidate split, a statistical test is used to evaluate the difference between the means of the two groups created by the split and a \(p\)-value can be computed for the test. Multiple comparison corrections reduce the number of false-positive test results that are incurred by conducting [several] statistical hypothesis tests. This reduces bias for highly granular data. This algorithm does not use pruning. As data sets are split, the decrease in the number of samples reduces the power of the hypothesis tests. This results in higher \(p\)-values and a lower likelihood of a new split (and over-fitting). Statistical hypothesis tests are not directly related to predictive performance and it is therefore still advisable to choose the complexity of the tree on the basis of performance (via resampling or some other means).
Single Decision Tree. In Simple Regression Trees each terminal node uses the average of the training set outcomes in that node for prediction and therefore may not do a good job predicting samples whose true outcomes are extremely high or low. The Regression Model Tree approach implemented with the M5 algorithm uses different splitting criterion, predicts the outcome using a linear model, and results in a model that is often a combination of the predictions from different models along the same path through the tree. Like Simple Regression Trees, the initial split is found using an exhaustive search over the predictors and training set samples, but, unlike those models, the expected reduction in the node’s error rate is used. Model Trees also incorporate a type of smoothing to decrease the potential for over-fitting. This smoothing can have a significant positive effect on the Model Tree when the linear models across nodes are very different.
Ensemble Method. Models based on single trees have particular weaknesses: (1) model instability (i.e., slight changes in the data can drastically change the structure of the tree or rules and, hence, the interpretation) and (2) less-than-optimal predictive performance. To combat these problems, researchers developed ensemble methods that combine many trees into one model. In the 1990s, ensemble techniques (methods that combine many models’ predictions) began to appear. Bagging (BAG), short for bootstrap aggregation, was one of the earliest developed ensemble techniques. Bagging is a general approach that uses bootstrapping in conjunction with any regression (or classification) model to construct an ensemble. Each model in the ensemble is used to generate a prediction for a new sample and these \(m\) predictions are averaged to give the bagged model’s prediction. For models that produce an unstable prediction, like regression trees, aggregating over many versions of the training data reduces the variance in the prediction and, hence, makes the prediction more stable. Bagging stable, lower variance models like linear regression and MARS offers less improvement in predictive performance. Although Bagging usually improves predictive performance for unstable models, results are less interpretable and computational costs are higher. CART (or conditional inference trees) can be used as the base learner in Bagging.
Ensemble Method. Generating bootstrap samples introduces a random component into the tree building process, which induces a distribution of trees, and therefore also a distribution of predicted values for each sample. The trees in Bagging, however, are not completely independent of each other since all the original predictors are considered at every split of every tree. This characteristic is known as tree correlation and prevents Bagging from optimally reducing variance of the predicted values. Random Forests (RF) are a unified model that [de-correlates trees by] adding randomness to the tree construction (learning) process [by implementating] random split selection[, building] entire trees based on random subsets of descriptors[, and] adding noise to the response in order to perturb tree structure. This ensemble of many independent, strong learners yields an improvement in error rates that is robust to a noisy response. At the same time, the independence of learners can underfit data when the response is not noisy. Compared to Bagging, Random Forests is more computationally efficient on a tree-by-tree basis. The ensemble nature of Random Forests makes it impossible to gain an understanding of the relationship between the predictors and the response. CART (or conditional inference trees) can be used as the base learner in Random Forests.
Ensemble Method. Boosting models were originally developed for classification problems and were later extended to the regression setting. Boosting algorithms (gradient boosting machines) are influenced by learning theory [where] weak classifiers that predict marginally better than random are combined (or boosted) to produce an ensemble classifier with a superior generalized misclassification error rate. The basic principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an additive model that minimizes the loss function. The algorithm is typically initialized with the best guess of the response (e.g., the mean of the response in regression), the gradient (e.g., residual) is then calculated, and a model is then fit to the residuals to minimize the loss function. The current model is added to the previous model, and the procedure continues for a user-specified number of iterations. To mitigate greediness (choosing the optimal weak learner at each stage), a constraint is built into the learning process by employing regularization, or shrinkage [where] only a fraction of the current predicted value (commonly referred to as the learning rate) is added to the previous iteration’s predicted value. Stochastic Gradient Boosting (SGB) implements all of the aforementioned using random sampling during selection of the training data. This [final] modification improves the prediction accuracy of Boosting while also reducing the demand for computational resources.
Single Decision Tree or Ensemble Method. A rule is defined as a distinct path through a tree. The number of samples affected by a rule is called its coverage. A rule-base model starts by creating an initial [unsmoothed] model tree [and then saving] only the rule with the largest coverage. The samples covered by the rule are removed from the training set and another model tree is created with the remaining data. Again, only the rule with the maximum coverage is retained. This process repeats until all the training set data have been covered by at least one rule. A new sample is then predicted by determining which rule(s) it falls under then the linear model associated with the largest coverage is applied. Pruning and smoothing must be applied judiciously as the impact of pruning on rules-based models is large and smoothing has a larger impact on unpruned models. Cubist (CUBE) is a rule-based model that is an amalgamation of several methodologies. [Previously] Cubist was previously only available in a commercial capacity, but in 2011 the source code was released under an open-source license. Cubist is different from previously described tree-based models and their rule-based variants in the techniques it uses for linear model smoothing, rule creation, and pruning. Cubist also employees an optional Boosting-like [ensemble] procedure called committees and predictions generated by the model rules can be adjusted using nearby points (neighbors) from the training set data. These nearest neighbors are determined using Manhattan (a.k.a. city block) distances and neighbors are only included if they are [not over the average distance] from the prediction sample.
Recreate the simulated data from Exercise 7.2:
set.seed(624)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
set.seed(624)
model1 <- randomForest(y ~ ., data=simulated, importance=TRUE, ntree=1000)
varimp1 <- varImp(model1)
Did the random forest model significantly use the uninformative predictors (V6
- V10
)?
Methods for fitting the random forest model and the variable importance score calculations are given in the question setup. The function randomForest()
is used for fitting the model. The parameter values importance=TRUE
and ntree=1000
dictate that the importance of predictors should be assessed and that the model should grow 1000 trees. The varImp()
function calculates the variable importance for the model.
varimp1
## Overall
## V1 52.1858851
## V2 40.4759963
## V3 12.1019585
## V4 61.0955614
## V5 23.6547291
## V6 1.1061162
## V7 -2.3741595
## V8 0.1635707
## V9 -1.0509349
## V10 1.9515464
Per Exercise 7.2, Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: \(y=10\sin { \left( \pi x_{ 1 }x_{ 2 } \right) } +20\left( x_{ 3 }-0.5 \right) ^{ 2 }+10x_{ 4 }+5x_{ 5 }+N(0,\sigma ^{ 2 })\) where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). Applying the Random Forest model to these data and then calculating the variable importance reveals that the Random Forest model favors the informative predictors (V1
- V5
) and does not significantly use the uninformative predictors (V6
- V10
).
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(624)
duplicated <- simulated
duplicated$V11 <- duplicated$V1 + rnorm(200) * 0.1
cor(duplicated$V11, duplicated$V1)
## [1] 0.9466669
Fit another random forest model to these data. Did the importance score for V1
change? What happens when you add another predictor that is also highly correlated with V1
?
Methods for adding an additional predictor that is highly correlated with one of the informative predictors are given in the question setup. This is done using the rnorm()
function which produces pseudo-random Gaussian numbers. Specifically, the question setup adds a pseudo-random Gaussian number multiplied by 10% to each element in the V1
vector to create the V11
vector. The correlation between V1
and V11
is then confirmed with the base R cor()
function. This new dataset from the question setup is fitted with a Random Forest model using the same randomForest()
function. Again, the parameter values importance=TRUE
and ntree=1000
dictate that the importance of predictors should be assessed and that the model should grow 1000 trees. The varImp()
function then calculates the variable importance for the new model. The variable importance calculations for both models are then output in a data frame for comparison. To create the data frame, the first model needs to be padded with a V11
predictor that has a variable importance value of NA
.
set.seed(624)
model2 <- randomForest(y ~ ., data=duplicated, importance=TRUE, ntree=1000)
varimp1 <- varImp(model1)
varimp2 <- varImp(model2)
colnames(varimp1) <- "Simulated"
colnames(varimp2) <- "Correlated"
padding <- data.frame(Simulated=NA, row.names="V11")
data.frame(rbind(varimp1, padding), varimp2)
## Simulated Correlated
## V1 52.1858851 32.0082853
## V2 40.4759963 39.7176368
## V3 12.1019585 11.2705227
## V4 61.0955614 58.6501578
## V5 23.6547291 24.0180835
## V6 1.1061162 1.1338830
## V7 -2.3741595 -1.7153863
## V8 0.1635707 0.4555065
## V9 -1.0509349 -1.0899257
## V10 1.9515464 1.2979827
## V11 NA 26.4040559
Adding the predictor V11
that is highly correlated with the most informative predictor V1
does change the variable importance score of V1
as well as the importance scores of all the other informative and uninformative variables under the Random Forest (RF) model. The importance of this highly correlated V11
predictor even outranks two of the five informative variables. This exercise highlights a weakness in tree-based models. When variables are highly correlated, tree-based models have a tendency to select more predictors than are actually needed with the choice between highly correlated predictors being nearly random.
Use the cforest
function in the party package to fit a random forest model using conditional inference trees. The party package function varimp
can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
The cforest()
implementation of the random forest (and bagging) algorithm differs from randomForest()
with respect to the base learners used and the aggregation scheme applied. Conditional inference Trees are fitted to each of the ntree
bootstrap samples (500 or defined via cforest_control
) of the learning sample. The cforest_control()
hyperparameter of mtry
which dictates the number of randomly preselected variables is set to \(P-1\), where \(P\) is the number of predictors. After the cforest()
model is fit, there are two functions that can calculate the importance of the variables. The varimp()
function spelled all in lowercase from the party
package calculates the standard and conditional variable importance for cforest()
models following the permutation principle of the “mean decrease in accuracy” importance found in the randomForest()
function. The varImp()
function with an uppercase “I” from the caret
package is a generic method for calculating variable importance for objects produced by train()
with specific methods such as cforest
. For the sake of consistency, the varImp()
function with an uppercase “I” is used. Again, the variable importance calculations for all the models are then output in a data frame for comparison. To create the data frame, models without the V11
predictor are padded with a V11
predictor that has a variable importance value of NA
.
set.seed(624)
ctrl1 <- cforest_control(mtry = ncol(simulated) - 1)
fit1 <- cforest(y ~ ., data = simulated, controls = ctrl1)
ctrl2 <- cforest_control(mtry = ncol(duplicated) - 1)
fit2 <- cforest(y ~ ., data = duplicated, controls = ctrl2)
varimp1 <- varImp(fit1)
varimp2 <- varImp(fit2)
varimp3 <- varImp(fit1, conditional=T)
varimp4 <- varImp(fit2, conditional=T)
colnames(varimp1) <- "Simulated_CIT_SI"
colnames(varimp2) <- "Correlated_CIT_SI"
colnames(varimp3) <- "Simulated_CIT_CI"
colnames(varimp4) <- "Correlated_CIT_CI"
padding1 <- data.frame(Simulated_CIT_SI=NA, row.names="V11")
padding2 <- data.frame(Simulated_CIT_CI=NA, row.names="V11")
data.frame(rbind(varimp1, padding1), rbind(varimp3, padding2), varimp2, varimp4)
## Simulated_CIT_SI Simulated_CIT_CI Correlated_CIT_SI Correlated_CIT_CI
## V1 8.463440565 4.618481133 7.892100991 3.166580468
## V2 4.771477540 2.468352173 4.776313322 2.291296308
## V3 0.205261839 0.046998983 0.186140203 0.041731395
## V4 13.017103358 7.018594423 12.956593825 7.536054721
## V5 2.128344753 1.184514596 2.085292315 0.969294330
## V6 0.092270913 0.019074037 0.037602144 0.008036876
## V7 0.013105955 0.003591459 0.002315421 -0.003410836
## V8 -0.046276478 -0.018015592 -0.042813099 -0.017941796
## V9 -0.015857479 -0.007643672 0.011124091 0.011078896
## V10 -0.004241806 -0.016000361 0.009453343 0.009261656
## V11 NA NA 0.445232778 0.076249725
Conditional variable importance scores are supposed to reflect “the true impact of each predictor variable more reliably than” the standard variable importance score calculation method. This however, does not seem to be the case here. For the Conditional Inference Trees (CIT), the conditional importance (CI) scores of informative variables follow the same ranking pattern as the standard importance (SI) scores for the simulated and highly correlated data. The pattern is nearly with same for the uninformative predictors as well. Inclusively, the decrease in importance of the informative variables when a highly correlated variable is included is greater when using the conditional importance (CI) scores.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Unlike the ipredbagg()
function from the ipred
package, the bagging()
function from the same package conducts bagging for classification, regression and survival trees using a formula interface (Y ~ X
) instead of a non-formula interface (y=Y, x=X
). The trees in the bagging()
function are computed using the implementation in the rpart
package. The nbagg
parameter specifies the number of bootstrap replications. The gbm()
function fits generalized boosted regression models. The distribution
parameter, which is dependent on the response variable, is set to Gaussian since the response is not of Bernoulli (2 classes), Multinomial (more than 2 classes), or Survival (right censored) types. The cubist()
function from the Cubist
package fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The committees
parameter specifies how many boosting interactions should be used in the ensemble. The Random Forest implementations randomForest()
and cforest()
are more flexible and reliable for computing Bootstrap Aggregated trees than the bagging()
function and although they should generally be used instead of the bagging()
function, the bagging()
function is being used in this exercise for demonstration reasons. Again, the variable importance calculations for all the models are then output in a data frame for comparison. To create the data frame, models without the V11
predictor are padded with a V11
predictor that has a variable importance value of NA
.
set.seed(624)
fit3 <- bagging(y ~ ., simulated, nbagg = 50)
fit4 <- bagging(y ~ ., duplicated, nbagg = 50)
set.seed(624)
fit5 <- gbm(y ~ ., simulated, n.trees = 100, distribution = "gaussian")
fit6 <- gbm(y ~ ., duplicated, n.trees = 100, distribution = "gaussian")
set.seed(624)
fit7 <- cubist(x=simulated[,-11], y=simulated[,11], committees = 100)
fit8 <- cubist(x=duplicated[,-11], y=duplicated[,11], committees = 100)
varimp1 <- varImp(fit3)
varimp2 <- varImp(fit4)
varimp3 <- varImp(fit5, numTrees = 100)
varimp4 <- varImp(fit6, numTrees = 100)
varimp5 <- varImp(fit7)
varimp6 <- varImp(fit8)
colnames(varimp1) <- "Sim_BAG"; colnames(varimp2) <- "Cor_BAG"
colnames(varimp3) <- "Sim_SGB"; colnames(varimp4) <- "Cor_SGB"
colnames(varimp5) <- "Sim_CUBE"; colnames(varimp6) <- "Cor_CUBE"
padding1 <- data.frame(Sim_BAG=NA, row.names="V11")
padding2 <- data.frame(Sim_SGB=NA, row.names="V11")
padding3 <- data.frame(Sim_CUBE=NA, row.names="V11")
data.frame(rbind(varimp1, padding1),
rbind(varimp3, padding2),
rbind(varimp5, padding3),
varimp2, varimp4, varimp6)
## Sim_BAG Sim_SGB Sim_CUBE Cor_BAG Cor_SGB Cor_CUBE
## V1 2.1613930 7873.1807 64.0 2.1429383 9506.4479 61.0
## V10 0.8968806 1629.1682 65.5 0.7543033 1599.3984 63.0
## V2 2.3825483 0.0000 46.5 1.8380012 0.0000 48.0
## V3 1.5342296 58742.6302 35.5 2.1571350 55112.2638 38.0
## V4 2.0067900 506.7549 25.5 1.3959607 472.4445 8.5
## V5 2.4033751 0.0000 2.5 2.0409922 0.0000 25.0
## V6 0.9973661 0.0000 0.5 2.3613080 0.0000 1.0
## V7 0.7475497 0.0000 4.0 0.6820392 0.0000 4.0
## V8 0.7030260 0.0000 3.0 0.6317266 0.0000 2.0
## V9 0.7401711 0.0000 2.5 0.5615145 0.0000 1.5
## V11 NA NA NA 0.7099651 1098.1727 0.5
Adding the predictor V11
that is highly correlated with the most informative predictor V1
has differing effects on the variable importance score of V1
as well as the importance scores of all the other informative and uninformative variables. The Bootstrap Aggregation (BAG) method performs worse than the Conditional Inference Trees (CIT) from the previous exercise. Adding the highly correlated variable makes the BAG model interpret the correlated variable V11
as unimportant, but one of the uninformative variables (V6
) as the most important. The Stochastic Gradient Boosting (SGB) method performs poorly as well with one uninformative variables (V10
) being listed as important and two informative variables (V2
, V5
) being listed as unimportant. The Rule-Based Cubist (CUBE) model fairs better than the other models by not marking any informative variables as unimportant, but it also lists an uninformative variable (V10
) as the most important variable. This exercise highlights a weakness in tree-based models. When variables are highly correlated, tree-based models have a tendency to select more predictors than are actually needed with the choice between highly correlated predictors being nearly random.
Use a simulation to show tree bias with different granularities.
Single Regression Trees suffer from selection bias. Predictors with a higher number of distinct values (lower variance) are favored over more granular (higher variance) predictors. The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. There is therefore a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all. Also, as the number of missing values increases, the selection of predictors becomes more biased. This phenomenon is best exemplified by creating a high variance predictor that is correlated to the response variable and a second predictor with less variance that is not related to the response variable. Specifically, a predictor X1
with 100 one values and 100 two values is created. Then a response variable Y
is created that is equal to X1
plus a pseudo-random Gaussian variable with \(\mu=0, \sigma=2\). A completely unrelated higher variance pseudo-random Gaussian variable X2
with \(\mu=0, \sigma=4\) is then created. These variables are then placed in a data frame and modeled using the rpart()
function which makes splits based on the CART methodology. The varImp()
function then calculates the variable importance for the model.
set.seed(624)
X1 <- rep(1:2, each=100)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)
simData <- data.frame(Y=Y, X1=X1, X2=X2)
set.seed(624)
fit <- rpart(Y ~ ., data = simData)
varImp(fit)
## Overall
## X1 0.8849928
## X2 4.7209459
The lower variance predictor X1
which is related to Y
is found to be less important than the higher variance X2
predictor which is completely unrelated to Y
. The selection bias pitfall of Single Regression Trees is highlighted well in this example. The importance of the unrelated X2
predictor is not just greater, it is substantially greater. The calculated importance value of the unelated X2
predictor is over five times larger than that of the related X1
predictor.
In Stochastic Gradient Boosting the Bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. [The below generates] variable importance plots for boosting using two extreme values for the Bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.
data(solubility)
X <- solTrainXtrans; Y <- solTrainY
indx <- createFolds(Y, returnTrain = T)
ctrl <- trainControl(method = "cv", index = indx)
set.seed(624)
tg1 <- expand.grid(interaction.depth=seq(1,7,by=2),
n.trees=seq(100,1000,by=100), shrinkage = c(0.01, 0.1), n.minobsinnode=10)
tune1 <- train(X, Y, method="gbm", tuneGrid=tg1, trControl=ctrl, verbose=F)
int_dpth <- tune1$bestTune$interaction.depth
ntrees <- tune1$bestTune$n.trees
set.seed(624)
tg2 <- expand.grid(interaction.depth=int_dpth,
n.trees=ntrees, shrinkage=0.1, n.minobsinnode=10)
tune2 <- train(X, Y, method = "gbm", tuneGrid=tg2,
trControl=ctrl, bag.fraction=0.1, verbose=F)
set.seed(624)
tg3 <- expand.grid(interaction.depth=int_dpth,
n.trees=ntrees, shrinkage=0.9, n.minobsinnode=10)
tune3 <- train(X, Y, method = "gbm", tuneGrid=tg3,
trControl=ctrl, bag.fraction=0.9, verbose=F)
A comparison of variable importance magnitudes for differing values of the Bagging fraction and shrinkage parameters. The left-hand plot has both [tuning] parameters set to 0.1, and the right-hand plot has both [tuning parameters] set to 0.9:
varimp1 <- varImp(tune2, scale=F); varimp2 <- varImp(tune3, scale=F)
plot1 <- plot(varimp1, top=25, scales=list(y=list(cex=.95)))
plot2 <- plot(varimp2, top=25, scales=list(y=list(cex=.95)))
print(plot1, split=c(1,1,2,1), more=T); print(plot2, split=c(2,1,2,1))
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid
parameter of the train()
function is fed a data frame with values in columns named after the parameters being tuned. The hyperparameters declared in the tuning grid to be evaluated for the gbm
(SGD) model are given in the question setup. Models with various variable interaction depths (interaction.depth
), fitted trees (n.trees
), learning rates (shrinkage
), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode
) are compared. Then the best interaction depth and number of iterations (fitted trees) from the tuning are examined more closely with their own SGB model. The first SGB model uses a 0.1 learning rate and selects 0.1 of the training set observations (bag.fraction
) to propose the next tree in the expansion. The second SGB model uses a 0.9 learning rate and selects 0.9 of the training set observations (bag.fraction
) to propose the next tree in the expansion.
tune1$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 75 500 7 0.1 10
varimp1 <- varImp(tune1)$importance
varimp2 <- varImp(tune2)$importance
df <- data.frame(sort(varimp1$Overall, T), sort(varimp2$Overall, T))
colnames(df) <- c("ShrinkBag_0.1" , "ShrinkBag_0.9")
head(df, 15)
## ShrinkBag_0.1 ShrinkBag_0.9
## 1 100.000000 100.000000
## 2 75.971280 46.330520
## 3 39.558941 45.716197
## 4 37.390117 43.425955
## 5 22.844465 36.992822
## 6 11.207339 23.950815
## 7 9.884140 22.157978
## 8 5.313661 17.196800
## 9 5.233103 15.727436
## 10 4.616990 13.632785
## 11 4.079079 13.384229
## 12 3.827388 12.159232
## 13 3.771225 10.210433
## 14 3.059019 9.302018
## 15 2.877306 8.447839
This question is somewhat hard to answer. According to the author, model greediness (choosing the optimal weak learner at each stage) which decreases the number of predictors, increases proportional to the learning rate. Therefore, the lower learning rate of 0.1 in the left model results in fewer predictors than the model on the right which has a 0.9 learning rate. This number of predictors is also proportional to the stochastic nature of the model such that increased randomness results in an increase in the number of predictors. Therefore, the lower bagging fraction of 0.1 in the left model results in fewer predictors than the more deterministic model on the right which selects 0.9 of the training set observations to propose the next tree in the expansion.
Which model do you think would be more predictive of other samples?
The first SGB models uses a 0.1 learning rate and selects 0.1 of the training set observations (bag.fraction
) to propose the next tree in the expansion. The second SGB models uses a 0.9 learning rate and selects 0.9 of the training set observations (bag.fraction
) to propose the next tree in the expansion. The bagging fraction introduces randomness to the model that improves the predictive ability of the model.
data.frame(model="SGB_0.1", tune2$bestTune, RMSE=min(tune2$results$RMSE), row.names="")
## model n.trees interaction.depth shrinkage n.minobsinnode RMSE
## SGB_0.1 500 7 0.1 10 0.6524385
data.frame(model="SGB_0.9", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")
## model n.trees interaction.depth shrinkage n.minobsinnode RMSE
## SGB_0.9 500 7 0.9 10 0.8252062
Greedy models are less likely to select the optimal global model and are prone to over-fitting. Stochastic models reduce prediction variance. Therefore, the less greedy model on the left (with a 0.1 learning rate) that is also more random (due to only selecting 0.1 of the training set observations to propose the next tree in the expansion) would be more predictive of other samples. The RMSE metric for the respective models supports this conclusion.
How would increasing interaction depth affect the slope of predictor importance for either model?
The first SGB models uses a 0.1 learning rate and selects 0.1 of the training set observations (bag.fraction
) to propose the next tree in the expansion. The second SGB models uses a 0.9 learning rate and selects 0.9 of the training set observations (bag.fraction
) to propose the next tree in the expansion. The learning rate shrinks the RMSE by constraining the tree depth with a penalty.
plot(tune1)
The positive impact of interaction depth on RMSE is proportional to the learning rate and number of trees. A higher learning rate in these data results in improvements up to the 300-tree mark. Models with a lower learning rate see greater improvements from interaction depth than models with a higher learning rate. Tree depth and learning rates impact the number of predictors which impacts the model greediness which impacts the predictive ability of the model which impacts the RMSE; all proportionally. Therefore, by transitivity, an increase in interaction depth will increase the number of predictors and RMSE. Variable importance will be spread across more predictors and the RMSE-Iterations graph will have a higher intercept that increases the negative slope.
Use a single predictor in the solubility data, such as the molecular weight or the number of carbon atoms and fit several models:
data(solubility)
SOL_train_X <- subset(solTrainXtrans, select="MolWeight")
SOL_train_Y <- solTrainY
SOL_test_X <- subset(solTestXtrans, select="MolWeight")
SOL_test_Y <- solTestY
A simple regression tree.
The training and test sets are predefined by the solubility
data. Selecting the molecular weight subset involves a simple application of the base R subset()
function on the dataset where the MolWeight
field is selected. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. CART can be implemented in the train()
function using the rpart
and the rpart2
methods. The rpart
method tunes the model over the complexity parameter (right-sized tree based on RMSE). The rpart2
method tunes the model over the maximum depth (of the tree).
set.seed(624)
tune1 <- train(SOL_train_X, SOL_train_Y, method = "rpart2", tuneLength = 1)
min(tune1$results$RMSE)
## [1] 1.663586
Fitting a CART model between the Molecular Weight predictor and the solubility
dataset target variable results in a 1.663586 RMSE.
A random forest model.
The training and test sets are predefined by the solubility
data. Selecting the molecular weight subset involves a simple application of the base R subset()
function on the dataset selecting the MolWeight
field. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The rf
method in the train()
function implements Random Forest classification and regression using the randomForest
package.
set.seed(624)
tune2 <- train(SOL_train_X, SOL_train_Y, method = "rf", tuneLength = 1)
min(tune2$results$RMSE)
## [1] 1.562868
Fitting a RF model between the Molecular Weight predictor and the solubility
dataset target variable results in a 1.562868 RMSE.
Different Cubist models with a single rule or multiple committees (each with and without using neighbor adjustments).
The training and test sets are predefined by the solubility
data. Selecting the molecular weight subset involves a simple application of the base R subset()
function on the dataset selecting the MolWeight
field. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid
parameter of the train()
function is fed a data frame with values in columns named after the parameters being tuned. The cubist
method in the train()
function implements a rule-based M5 model with additional corrections based on nearest neighbors in the training set using the Cubist
package. The hyperparameters declared in the tuning grid to be evaluated are the committees
parameter which specifies how many boosting interactions should be used in the ensemble, and the neighbors
parameter which specifies the number of instances to use to correct the rule-based prediction.
set.seed(624)
tg3 <- expand.grid(committees = 1, neighbors = 0)
tune3 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg3)
min(tune3$results$RMSE)
## [1] 1.551064
set.seed(624)
tg4 <- expand.grid(committees = 1, neighbors = c(1,3,5,7))
tune4 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg4)
min(tune4$results$RMSE)
## [1] 1.619041
set.seed(624)
tg5 <- expand.grid(committees = 100, neighbors = 0)
tune5 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg5)
min(tune5$results$RMSE)
## [1] 1.5287
set.seed(624)
tg6 <- expand.grid(committees = 100, neighbors = c(1,3,5,7))
tune6 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg6)
min(tune6$results$RMSE)
## [1] 1.602018
Fitting CUBE models between the Molecular Weight predictor and the solubility
dataset target variable result in a RMSE ranging between [1.528700, 1.619041].
Plot the predictor data versus the solubility results for the test set. Overlay the model predictions for the test set. How do the models differ? Does changing the tuning parameter(s) significantly affect the model fit?
The predict()
function applies a fitted model to data consisting of predictor variables. A function is created to plot the predictor test data versus the solubility test results then overlay the predictions of the six models. The postResample()
function takes two vectors as its inputs and computes the RMSE, \(R^2\), and MAE performance metrics.
models <- c("CART","RF","CUBE_1.0","CUBE_1.n","CUBE_C.0","CUBE_C.n")
fcast1 <- predict(tune1, newdata = SOL_test_X)
fcast2 <- predict(tune2, newdata = SOL_test_X)
fcast3 <- predict(tune3, newdata = SOL_test_X)
fcast4 <- predict(tune4, newdata = SOL_test_X)
fcast5 <- predict(tune5, newdata = SOL_test_X)
fcast6 <- predict(tune6, newdata = SOL_test_X)
fcast <- cbind(fcast1, fcast2, fcast3, fcast4, fcast5, fcast6)
par(mfrow=c(3, 2), mar = c(0, 1, 2, 0), oma = c(0, 0, 0.5, 0.5))
for (i in 1:6) {
X <- SOL_test_X$MolWeight; Y <- SOL_test_Y
plot(X, Y, xaxt="n", yaxt="n", xlab="", ylab="", main=models[i])
par(new=T);
plot(X, fcast[,i], xaxt="n", yaxt="n", xlab="", ylab="", col="blue")
}
data.frame(row.names = models, rbind(
postResample(pred = fcast1, obs = SOL_test_Y),
postResample(pred = fcast2, obs = SOL_test_Y),
postResample(pred = fcast3, obs = SOL_test_Y),
postResample(pred = fcast4, obs = SOL_test_Y),
postResample(pred = fcast5, obs = SOL_test_Y),
postResample(pred = fcast6, obs = SOL_test_Y)))
## RMSE Rsquared MAE
## CART 1.712811 0.3219683 1.3079220
## RF 1.333482 0.5921083 0.9325201
## CUBE_1.0 1.543414 0.4862360 1.0864710
## CUBE_1.n 1.585349 0.4235102 1.1716094
## CUBE_C.0 1.540014 0.4881868 1.0856421
## CUBE_C.n 1.586609 0.4223147 1.1720924
Overlaying the model predictions on the test set is very useful in highlighting the differences in the models. The CART model predicts values within two categories that does not fit the data very well. The RF model fits the data much better but its predictions are the most heteroskedastic of all the fitted models. The CUBE models appear perform the best but there are differences between them. Although the number of committees (which specifies how many boosting interactions should be used in the ensemble) has little to no effect in the CUBE model predictions, the number of neighbors (which specifies the number of instances to use to correct the rule-based prediction) has a significant impact. Declaring zero neighbors makes the CUBE model predictions homoskedastic to a fault and returns a hockey-stick shaped prediction reminiscent of nonlinear MARS models. Declaring multiple neighbors adds variance to the prediction which improves the fit. Interestingly enough, the RMSE, \(R^2\) and MAE performance metrics indicate that the RF model with the most heteroskedasticity is the best fit for these data.
Fit different tree- and rule-based models for the Tecator data discussed [below]. How do they compare to linear models? Do the between-predictor correlations seem to affect your models? If so, how would you transform or re-encode the predictor data to mitigate this issue?
Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process.
data(tecator)
set.seed(624)
rows_train <- createDataPartition(endpoints[, 3], p = 3/4, list= FALSE)
TEC_train_X <- as.data.frame(absorp[rows_train, ])
TEC_test_X <- as.data.frame(absorp[-rows_train, ])
TEC_train_Y <- endpoints[rows_train, 3]
TEC_test_Y <- endpoints[-rows_train, 3]
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
set.seed(624)
tune1 <- train(x = TEC_train_X, y = TEC_train_Y, method="pls",
trControl = ctrl, preProcess = c("center", "scale"), tuneLength = 25)
The tecator
data partitioning, controls, and Partial Least Squares (PLS) model are given in the question setup. The createDataPartition()
function helps create a series of test and training partitions. The p
parameter sets the percentage of data that goes to training. Declaring list=FALSE
returns a matrix. The trainControl()
function controls the computational nuances of the train()
function. The repeatedcv
resampling method and number
parameters are specific the type of \(k\)-fold cross-validation to be implemented. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid
parameter of the train()
function is fed a data frame with values in columns named after the parameters being tuned. CART is implemented in the train()
function using the rpart
method which tunes the model over the complexity parameter (right-sized tree based on RMSE). BAG is implemented in the train()
function using the treebag
method which conducts bagging for classification, regression and survival trees using the implementation in the rpart
package. RF is implemented in the train()
function using the rf
method which implements Random Forest classification and regression using the randomForest
package. SGB is implemented in the train()
function using the gbm
method fits generalized boosted regression models. The hyperparameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth
), fitted trees (n.trees
), learning rates (shrinkage
), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode
). CUBE is implemented in the train()
function using the cubist
method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The hyperparameters declared in the tuning grid to be evaluated are the committees
parameter which specifies how many boosting interactions should be used in the ensemble, and the neighbors
parameter which specifies the number of instances to use to correct the rule-based prediction. A function is created to help display the resampling performance metrics as a data frame. The bwplot()
function plots a series of vertical box-and-whisker plots for the resampling performance metrics.
set.seed(624)
tune2 <- train(x = TEC_train_X, y = TEC_train_Y,
method="rpart", trControl = ctrl, tuneLength = 25)
set.seed(624)
tune3<- train(x = TEC_train_X, y = TEC_train_Y,
method="treebag", trControl = ctrl)
set.seed(624)
tune4<- train(x = TEC_train_X, y = TEC_train_Y, method = "rf",
ntree = 1500, tuneLength = 10, trControl = ctrl)
set.seed(624)
tg5 <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50), shrinkage = c(0.01, 0.1), n.minobsinnode=10)
tune5 <- train(x = TEC_train_X, y = TEC_train_Y, method = "gbm",
verbose = FALSE, tuneGrid = tg5, trControl = ctrl)
set.seed(624)
tg6 <- expand.grid(committees = c(1:10, 20, 50, 75, 100), neighbors = c(0, 1, 5, 9))
tune6 <- train(x = TEC_train_X, y = TEC_train_Y, method = "cubist",
verbose = FALSE, tuneGrid = tg6, trControl = ctrl)
metrics <- function(tune) {
RMSE = min(tune$results$RMSE)
Rsquared = max(tune$results$Rsquared)
MAE = min(tune$results$MAE)
return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2), metrics(tune3),
metrics(tune4), metrics(tune5), metrics(tune6)),
row.names = c("PLS","CART","BAG","RF","SGB","CUBE"))
## RMSE Rsquared MAE
## PLS 0.5641773 0.9667829 0.4537333
## CART 2.6292768 0.3216989 2.0270089
## BAG 2.3087450 0.4360403 1.8714800
## RF 2.1175918 0.5259554 1.6867034
## SGB 1.9179625 0.6075514 1.4417302
## CUBE 0.5795218 0.9644043 0.4416561
fits <- list(PLS=tune1, CART=tune2, BAG=tune3, RF=tune4, SGB=tune5, CUBE=tune6)
bwplot(resamples(fits))
After fitting a CART, BAG, RF, SGB, and CUBE model to the tecator
data which were previously found to be best suited for fitting with a linear PLS model, it appears that a CUBE model is also well suited for modeling the tecator
data. The CUBE and PLS models outperform the other tree-based models in every resampling performance metric. The RMSE and \(R^2\) resampling performance metrics of the PLS model are superior to those of the CUBE model however. The CUBE model only outperforms the PLS model in the MAE resampling performance metric. Based on these performance metrics, the linear PLS model still appears to provide the best fit for the tecator
data (although the CUBE model comes in at a close second). In Exercise 8.2 it was found that predictor correlations do have an impact CUBE models so removing highly correlated variables using findCorrelation(cor(.))
may yield improvements that can make a CUBE model a better fit for the tecator
data that a PLS model.
Return to the permeability problem described [below]. Train several tree-based models and evaluate the resampling and test set performance.
Permeability is the measure of a molecule’s ability to cross a membrane. Compounds that appear to be effective for a particular disease in research screening experiments but appear to be poorly permeable may need to be altered in order to improve permeability and thus the compound’s ability to reach the desired target. Permeability assays such as PAMPA and Caco-2 have been developed to help measure compounds’ permeability. These screens are effective at quantifying a compound’s permeability, but the assay is expensive labor intensive. Given a sufficient number of compounds that have been screened, we could develop a predictive model for permeability in an attempt to potentially reduce the need for the assay. In this project there were 165 unique compounds; 1,107 molecular fingerprints were determined for each. A molecular fingerprint is a binary sequence of numbers that represents the presence or absence of a specific molecular substructure. The response is highly skewed, the predictors are sparse, and many predictors are strongly associated. Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.
library(AppliedPredictiveModeling)
data(permeability)
nzv <- nearZeroVar(fingerprints)
fingerprints <- fingerprints[,-nzv]
set.seed(624)
rows_train <- createDataPartition(permeability, p = 0.75, list = FALSE)
PER_train_X <- fingerprints[rows_train,]
PER_train_Y <- permeability[rows_train,]
PER_test_X <- fingerprints[-rows_train,]
PER_test_Y <- permeability[-rows_train,]
ctrl <- trainControl(method = "LGOCV")
set.seed(624)
tg1 <- expand.grid(lambda = c(0, 0.05, .1), fraction = seq(.05, 1, length = 25))
tune1 <- train(x = PER_train_X, y = log10(PER_train_Y),
method = "enet", tuneGrid = tg1, trControl = ctrl)
set.seed(624)
tg2 <- expand.grid(sigma = c(0.0005,0.001,0.0015), C = seq(1,49,by=6))
tune2 <- train(x = PER_train_X, y = log10(PER_train_Y),
method = "svmRadial", tuneGrid = tg2, trControl = ctrl)
Which tree-based model gives the optimal resampling and test set performance?
The permeability
data partitioning, transformation, controls, Elastic Net Regression (ENET) model, and Support Vector Machines (SVM) model are given in the question setup. The createDataPartition()
function helps create a series of test and training partitions. The p
parameter sets the percentage of data that goes to training. Declaring list=FALSE
returns a matrix. The trainControl()
function controls the computational nuances of the train()
function. The LGOCV
resampling method and number
parameters specific the type of leave-group-out cross-validation to be implemented. The train()
function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid
parameter of the train()
function is fed a data frame with values in columns named after the parameters being tuned. CART is implemented in the train()
function using the rpart2
method which tunes the model over the maximum depth (of the tree). RF is implemented in the train()
function using the rf
method which implements Random Forest classification and regression using the randomForest
package. SGB is implemented in the train()
function using the gbm
method fits generalized boosted regression models. The hyperparameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth
), fitted trees (n.trees
), learning rates (shrinkage
), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode
). A function is created to help display the resampling performance metrics as a data frame. The bwplot()
function plots a series of vertical box-and-whisker plots for the resampling performance metrics. The predict()
function applies a fitted model to data consisting of predictor variables. A data frame with the test set performance metrics is then displayed.
set.seed(624)
tg3 <- expand.grid(maxdepth= seq(1,10,by=1))
tune3 <- train(x = PER_train_X, y = log10(PER_train_Y),
method = "rpart2", tuneGrid = tg3, trControl = ctrl)
set.seed(624)
tune4 <- train(x = PER_train_X, y = log10(PER_train_Y),
method = "rf", tuneLength = 10, importance = TRUE, trControl = ctrl)
set.seed(624)
tg5 <- expand.grid(interaction.depth=seq(1,6,by=1), n.minobsinnode=10,
n.trees=c(25,50,100,200), shrinkage=c(0.01,0.05,0.1))
tune5 <- train(x = PER_train_X, y = log10(PER_train_Y),
method = "gbm", verbose = F, tuneGrid = tg5, trControl = ctrl)
metrics <- function(tune) {
RMSE = min(tune$results$RMSE)
Rsquared = max(tune$results$Rsquared)
MAE = min(tune$results$MAE)
return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune3), metrics(tune4),
metrics(tune5)), row.names = c("CART","RF","SGB"))
## RMSE Rsquared MAE
## CART 0.5154786 0.4591854 0.3802313
## RF 0.4869323 0.5249267 0.3657372
## SGB 0.4701520 0.5548857 0.3613686
fits <- list(CART=tune3, RF=tune4, SGB=tune5)
bwplot(resamples(fits))
fcast3 <- predict(tune3, newdata = PER_test_X)
fcast4 <- predict(tune4, newdata = PER_test_X)
fcast5 <- predict(tune5, newdata = PER_test_X)
data.frame(rbind(postResample(pred = fcast3, obs = PER_test_Y),
postResample(pred = fcast4, obs = PER_test_Y),
postResample(pred = fcast5, obs = PER_test_Y)),
row.names = c("CART","RF","SGB"))
## RMSE Rsquared MAE
## CART 18.75267 0.3466993 11.66874
## RF 18.69633 0.4892234 11.60125
## SGB 18.71166 0.4054850 11.59132
After fitting tree-based CART, RF, and SGB models to the permeability
data which were previously found to be best suited for fitting with a linear ENET or a nonlinear SVM model, it appears that most suitable tree-based model fit for the permeability
data is an SGB model. The SGB model outperforms the other tree-based models in every resampling performance metric. The RF model does outperform the other tree-based models on the test set, but these test set metrics are less important to predictive ability than the resampling performance metrics.
Do any of these models outperform the covariance or non-covariance based regression models you have previously developed for these data? What criteria did you use to compare models’ performance?
A function is created to help display the resampling performance metrics as a data frame. The bwplot()
function plots a series of vertical box-and-whisker plots for the resampling performance metrics. A data frame with the test set performance metrics is also displayed.
metrics <- function(tune) {
RMSE = min(tune$results$RMSE)
Rsquared = max(tune$results$Rsquared)
MAE = min(tune$results$MAE)
return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2),
metrics(tune3), metrics(tune4), metrics(tune5)),
row.names = c("ENET","SVM","CART","RF","SGB"))
## RMSE Rsquared MAE
## ENET 0.4673019 0.5684217 0.3533567
## SVM 0.5118277 0.4741713 0.3859506
## CART 0.5154786 0.4591854 0.3802313
## RF 0.4869323 0.5249267 0.3657372
## SGB 0.4701520 0.5548857 0.3613686
fits <- list(ENET=tune1, SVM=tune2, CART=tune3, RF=tune4, SGB=tune5)
bwplot(resamples(fits))
fcast1 <- predict(tune1, newdata = PER_test_X)
fcast2 <- predict(tune2, newdata = PER_test_X)
fcast3 <- predict(tune3, newdata = PER_test_X)
fcast4 <- predict(tune4, newdata = PER_test_X)
fcast5 <- predict(tune5, newdata = PER_test_X)
data.frame(rbind(postResample(pred = fcast1, obs = PER_test_Y),
postResample(pred = fcast2, obs = PER_test_Y),
postResample(pred = fcast3, obs = PER_test_Y),
postResample(pred = fcast4, obs = PER_test_Y),
postResample(pred = fcast5, obs = PER_test_Y)),
row.names = c("ENET","SVM","CART","RF","SGB"))
## RMSE Rsquared MAE
## ENET 18.71909 0.4051683 11.56402
## SVM 18.65525 0.3940251 11.50458
## CART 18.75267 0.3466993 11.66874
## RF 18.69633 0.4892234 11.60125
## SGB 18.71166 0.4054850 11.59132
After fitting tree-based CART, RF, and SGB models to the permeability
data which were previously found to be best suited for fitting with a covariance (linear) ENET or non-covariance (nonlinear) SVM model, it appears that most suitable model for the permeability
data is still the ENET model. The ENET model outperforms the SVM and tree-based models in every resampling performance metric. Other models outperform the ENET model on the test set, but these test set metrics are less important to predictive ability than the resampling performance metrics.
Of all the models you have developed thus far, which, if any, would you recommend to replace the permeability laboratory experiment?
The train()
function, once computed and stored to a variable, contains several attributes which help evaluate model performance.
tune1$modelInfo$label
data.frame(model="ENET", tune1$bestTune, RMSE=min(tune1$results$RMSE), row.names="")
plot(tune1)
## [1] "Elasticnet"
## model fraction lambda RMSE
## ENET 0.1291667 0.1 0.4673019
tune3$modelInfo$label
data.frame(model="CART", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")
plot(tune3)
## [1] "CART"
## model maxdepth RMSE
## CART 3 0.5154786
Using the resampling performance metrics as a guide in choosing the recommendation of a replacement for the permeability laboratory experiment, it would advisable to opt for the ENET model since the ENET model performs better than the SVM and tree-based models. If, however, a more interpretable model is necessary, the CART model would be a good choice. Its resampling performance is relatively close to that of the ENET model and it is more interpretable by non-technical persons.
Refer to Exercise 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
CMP <- get(data(ChemicalManufacturingProcess))
rm(ChemicalManufacturingProcess) # list not needed
set.seed(624)
rows_train <- createDataPartition(CMP$Yield, p=0.75, list=F)
CMP_train <- CMP[rows_train, ]
CMP_test <- CMP[-rows_train, ]
prepro <- preProcess(subset(CMP_train, select = - Yield),
method=c("BoxCox", "center", "scale", "knnImpute"))
CMP_train_X <- predict(prepro, CMP_train[ , -1])
nzv <- nearZeroVar(CMP_train_X)
mcl <- findCorrelation(cor(CMP_train_X))
CMP_train_X <- CMP_train_X[ ,-c(nzv, mcl)]
set.seed(624)
ctrl1 <- trainControl(method = "boot", number = 25)
tune1 <- train(x = CMP_train_X, y = CMP_train$Yield,
method="pls", tuneLength=15, trControl=ctrl1)
set.seed(624)
ctrl2 <- trainControl(method = "cv")
tg2 <- data.frame(.k = 1:20)
tune2 <- train(x = CMP_train_X, y = CMP_train$Yield,
method = "knn", tuneGrid = tg2, trControl = ctrl2)
CART is implemented in the train()
function using the rpart2
method which tunes the model over the maximum depth (of the tree).
set.seed(624)
ctrl3 <- trainControl(method = "boot", number = 25)
tg3 <- expand.grid(maxdepth= seq(1,10,by=1))
tune3 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "rpart2",
metric = "Rsquared", tuneGrid = tg3, trControl = ctrl3)
CIT is implemented in the train()
function using the ctree2
method where the number of randomly selected predictors to choose from at each split is set by mtry
. Since Random Forests is computationally intensive, mtry
is tuned with around five values that are somewhat evenly spaced across the range from 2 to \(P\), where \(P\) is the number of predictors.
set.seed(624)
ctrl4 <- trainControl(method = "cv")
tg4 <- expand.grid(maxdepth= seq(1,10,by=1), mincriterion=1-c(0.01, 0.05, 0.1))
tune4 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "ctree2",
metric = "Rsquared", tuneGrid = tg4, trControl = ctrl4)
M5 is implemented in the train()
function using the M5
method from the RWeka
package. The prune
, smooth
, and rules
hyperparameters evaluate models with and without those features.
set.seed(624)
yn <- c("Yes","No")
ctrl5 <- trainControl(method = "cv")
tg5 <- expand.grid(pruned = yn, smoothed = yn, rules = yn)
tune5 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "M5",
metric = "Rsquared", tuneGrid = tg5, trControl = ctrl5)
BAG is implemented in the train()
function using the bag
method which conducts bagging for classification and regression using the implementation in the caret
package. The bagControl()
function controls the computational nuances of this train()
function method. The hyperparameters declared in the tuning grid are fit, predict and aggregate which all take a function that provides a framework for bagging classification or regression models. The function used for the bagging predictions is the tree-based function ctreeBag
.
n <- ncol(CMP_train_X)
seq_len(abs(n))[n %% seq_len(abs(n)) == 0L]
## [1] 1 2 3 4 6 8 12 16 24 48
set.seed(624)
bag6 <- bagControl(fit=ctreeBag$fit, predict=ctreeBag$pred, aggregate=ctreeBag$aggregate)
ctrl6 <-trainControl(method = "cv")
tg6 <- data.frame(vars = seq(2, n, 11))
tune6 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "bag", B = 25,
metric = "Rsquared", tuneGrid = tg6, trControl = ctrl6, bagControl = bag6)
RF is implemented in the train()
function using the rf
method which implements Random Forest classification and regression using the randomForest
package.
set.seed(624)
ctrl7 <- trainControl(method = "boot", number = 25)
tg7 <- expand.grid(mtry=seq(2,38,by=3))
tune7 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "rf",
metric = "Rsquared", tuneGrid = tg7, trControl = ctrl7)
SGB is implemented in the train()
function using the gbm
method fits generalized boosted regression models. The hyperparameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth
), fitted trees (n.trees
), learning rates (shrinkage
), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode
).
set.seed(624)
ctrl8 <- trainControl(method = "boot", number = 25)
tg8 <- expand.grid(interaction.depth=seq(1,6,by=1), n.trees=c(25,50,100,200),
shrinkage=c(0.01,0.05,0.1,0.2), n.minobsinnode=10)
tune8 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "gbm",
metric = "Rsquared", tuneGrid = tg8, trControl = ctrl8, verbose=F)
CUBE is implemented in the train()
function using the cubist
method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The hyperparameters declared in the tuning grid to be evaluated are the committees
parameter which specifies how many boosting interactions should be used in the ensemble, and the neighbors
parameter which specifies the number of instances to use to correct the rule-based prediction.
set.seed(624)
ctrl9 <- trainControl(method = "boot", number = 25)
tg9 <- expand.grid(committees = c(1,5,10,20,50,100), neighbors = c(0,1,3,5,7))
tune9 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "cubist",
metric = "Rsquared", tuneGrid = tg9, trControl = ctrl9)
Which tree-based regression model gives the optimal resampling and test set performance?
The ChemicalManufacturingProcess
data partitioning, preprocessing, controls, PLS model, and KNN model are given in the question setup. The train()
function, once computed and stored to a variable, contains several attributes which help evaluate model performance. A function is created to help display the resampling performance metrics as a data frame. A slight issue arises during forecasting due to a value of \(-\infty\) remaining in the pre-processed dataset. This infinite value hinders forecasting with the Neural Network and \(K\)-Nearest Neighbor models but does not affect the Multivariate Adaptive Regression Splines and Support Vector Machine models. For the sake of completeness in model comparison, the value of \(-\infty\) is replaced with the minimum finite value of the variable to which the infinite value belongs. The predict()
function applies a fitted model to data consisting of predictor variables. The postResample()
function takes two vectors as its inputs and computes the RMSE, \(R^2\), and MAE performance metrics. A data frame with the test set performance metrics is then displayed.
tune3$modelInfo$label
data.frame(model="CART", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")
plot(tune3)
## [1] "CART"
## model maxdepth RMSE
## CART 8 1.5226
tune4$modelInfo$label
data.frame(model="CIT", tune4$bestTune, RMSE=min(tune4$results$RMSE), row.names="")
plot(tune4)
## [1] "Conditional Inference Tree"
## model maxdepth mincriterion RMSE
## CIT 1 0.9 1.420923
tune5$modelInfo$label
data.frame(model="M5", tune5$bestTune, RMSE=min(tune5$results$RMSE), row.names="")
plot(tune5)
## [1] "Model Tree"
## model pruned smoothed rules RMSE
## M5 No Yes Yes 1.44276
tune6$modelInfo$label
data.frame(model="BAG", tune5$bestTune, RMSE=min(tune6$results$RMSE), row.names="")
plot(tune6)
## [1] "Bagged Model"
## model pruned smoothed rules RMSE
## BAG No Yes Yes 1.266828
tune7$modelInfo$label
data.frame(model="RF", tune7$bestTune, RMSE=min(tune7$results$RMSE), row.names="")
plot(tune7)
## [1] "Random Forest"
## model mtry RMSE
## RF 11 1.249657
tune8$modelInfo$label
data.frame(model="SGB", tune8$bestTune, RMSE=min(tune8$results$RMSE), row.names="")
plot(tune8)
## [1] "Stochastic Gradient Boosting"
## model n.trees interaction.depth shrinkage n.minobsinnode RMSE
## SGB 200 6 0.05 10 1.263188
tune9$modelInfo$label
data.frame(model="CUBE", tune9$bestTune, RMSE=min(tune9$results$RMSE), row.names="")
plot(tune9)
## [1] "Cubist"
## model committees neighbors RMSE
## CUBE 100 1 1.125089
metrics <- function(tune) {
RMSE = min(tune$results$RMSE)
Rsquared = max(tune$results$Rsquared)
MAE = min(tune$results$MAE)
return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2),
metrics(tune3), metrics(tune4), metrics(tune5),
metrics(tune6), metrics(tune7), metrics(tune8), metrics(tune9)),
row.names = c("PLS","KNN","CART","CIT","M5","BAG","RF","SGB","CUBE"))
## RMSE Rsquared MAE
## PLS 1.276354 0.5037055 1.0554596
## KNN 1.263295 0.5285458 1.0324313
## CART 1.522600 0.3430106 1.1958832
## CIT 1.420923 0.4054332 1.1562190
## M5 1.442760 0.4578988 1.1418822
## BAG 1.266828 0.5100838 1.0055814
## RF 1.249657 0.5314406 0.9782162
## SGB 1.263188 0.5076208 0.9857264
## CUBE 1.125089 0.6074287 0.8624796
CMP_test_X <- predict(prepro, CMP_test[ , -1])
CMP_test_X <- CMP_test_X[ ,-c(nzv, mcl)]
val <- min(CMP_test_X[is.finite(CMP_test_X[,35]), 35])
CMP_test_X[CMP_test_X[,35] == -Inf, 35] <- val
fcast1 <- predict(tune1, newdata = CMP_test_X)
fcast2 <- predict(tune2, newdata = CMP_test_X)
fcast3 <- predict(tune3, newdata = CMP_test_X)
fcast4 <- predict(tune4, newdata = CMP_test_X)
fcast5 <- predict(tune5, newdata = CMP_test_X)
fcast6 <- predict(tune6, newdata = CMP_test_X)
fcast7 <- predict(tune7, newdata = CMP_test_X)
fcast8 <- predict(tune8, newdata = CMP_test_X)
fcast9 <- predict(tune9, newdata = CMP_test_X)
data.frame(rbind(postResample(pred = fcast1, obs = CMP_test$Yield),
postResample(pred = fcast2, obs = CMP_test$Yield),
postResample(pred = fcast3, obs = CMP_test$Yield),
postResample(pred = fcast4, obs = CMP_test$Yield),
postResample(pred = fcast5, obs = CMP_test$Yield),
postResample(pred = fcast6, obs = CMP_test$Yield),
postResample(pred = fcast7, obs = CMP_test$Yield),
postResample(pred = fcast8, obs = CMP_test$Yield),
postResample(pred = fcast9, obs = CMP_test$Yield)),
row.names = c("PLS","KNN","CART","CIT","M5","BAG","RF","SGB","CUBE"))
## RMSE Rsquared MAE
## PLS 1.761001 0.3990817 1.1104420
## KNN 1.368251 0.5493334 1.0565909
## CART 1.749335 0.2936150 1.3255474
## CIT 1.585968 0.3970426 1.2146354
## M5 2.050190 0.2781859 1.3634098
## BAG 1.357619 0.5735154 1.0251716
## RF 1.372159 0.5694778 0.9933338
## SGB 1.189012 0.6715511 0.8399855
## CUBE 1.603838 0.5225231 0.8327512
After fitting tree-based CART, CIT, M5, BAG, RF, SGB and CUBE models to the ChemicalManufacturingProcess
data which were previously found to be best suited for fitting with a linear PLS or nonlinear KNN model, it appears that most suitable model for the ChemicalManufacturingProcess
data is a CUBE model. The CUBE model with 100 committees and 1 neighbor outperforms the other models in every resampling performance metric. Several models outperform the CUBE model on the test set with the SGB performing the best, but these test set metrics are less important to predictive ability than the resampling performance metrics.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The dotPlot()
function create a dotplot of variable importance values. The varImp()
function calculates the variable importance for a given model.
dotPlot(varImp(tune6), top=15)
varimp1 <- varImp(tune9)$importance
varimp2 <- varImp(tune1)$importance
varimp3 <- varImp(tune2)$importance
colnames(varimp1) <- "CUBE"
colnames(varimp2) <- "PLS"
colnames(varimp3) <- "KNN"
imp <- data.frame(varimp1, varimp2, varimp3)
head(imp[order(imp$CUBE, decreasing = T), ], 10)
## CUBE PLS KNN
## ManufacturingProcess32 100 50.30322 35.84409
## BiologicalMaterial03 50 54.80568 49.52172
## BiologicalMaterial06 50 52.82129 42.61298
## ManufacturingProcess09 50 25.67340 14.77863
## ManufacturingProcess25 50 61.51722 74.50577
## ManufacturingProcess29 50 52.70012 27.90151
## BiologicalMaterial01 0 18.94683 17.86572
## BiologicalMaterial04 0 42.67018 16.62820
## BiologicalMaterial05 0 47.86206 30.32483
## BiologicalMaterial08 0 17.20016 11.17837
The CUBE model which was found to be the optimal tree-based regression model, the KNN model which was found to be the optimal nonlinear regression model, and the PLS model which was found to be the optimal linear regression model all agree that the most important predictor is the ManufacturingProcess32
process variable. All models also agree that BiologicalProcess32
is a very important biological process, but the CUBE model ranks it equal to the BiologicalMaterial03
biological process. The PLS and KNN models clearly distinguish BiologicalProcess32
as the most important biological process and BiologicalMaterial03
as the second most important biological process. Another interesting difference in the variable importance of the CUBE model is that only five predictors are listed as important, one of which (ManufacturingProcess25
) is not even in the top ten of the PLS or the KNN models.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Decision trees for single tree models can be easily plotted using the base R plot()
function with the finalModel
attribute of the tuned model. Plotting decision trees from CIT models presents no issues, but plotting decision trees from CART models sometimes results in blank nodes. Wrapping the finalModel
attribute of the tuned CART model in the as.party()
function from the partykit
library resolves these CART model decision tree plotting issues. To avoid overlapping text that renders the tree illegible, the font size is reduced in using the gp=gpar()
parameter.
plot(tune4$finalModel)
plot(as.party(tune3$finalModel), gp=gpar(fontsize = 8))
The optimal single tree model based on RMSE, \(R^2\), and MAE resampling performance metrics is the CIT model. The decision tree produced for the CIT model shows a single split for ManufacturingProcess32
at the 0.111 value. Values lower than 0.111 are associated with lower yields as represented by the boxplot distributions. Values over 0.111 are associated with higher yields as represented by the boxplot distributions. The visualization of the tree does help in understanding the relationship between ManufacturingProcess32
and the yield, but not between any other predictor and the yield since there is only one split. The CART model tree is plotted for comparison. This tree has several branches and details the same information about ManufacturingProcess32
, albeit at a different cutoff value of 0.2, but also provides a much greater understanding of the relationship between the biological and process predictors and the yield due to the additional branches.
https://rpubs.com/josezuniga/366918
https://rpubs.com/josezuniga/376348
https://rpubs.com/josezuniga/377753
http://appliedpredictivemodeling.com/
https://github.com/topepo/APM_Exercises
https://www.ncbi.nlm.nih.gov/pubmed/18620558
https://cran.r-project.org/web/packages/gbm/gbm.pdf
https://cran.r-project.org/web/packages/caret/caret.pdf
https://cran.r-project.org/web/packages/RWeka/RWeka.pdf
https://topepo.github.io/caret/train-models-by-tag.html#Bagging
https://github.com/topepo/caret/blob/master/RegressionTests/Code/M5.R
https://github.com/topepo/caret/blob/master/RegressionTests/Code/bag.R
http://appliedpredictivemodeling.com/blog/2014/11/12/solutions-on-github
https://github.com/topepo/caret/blob/master/RegressionTests/Code/ctree2.R
http://www.rdocumentation.org/packages/caret/versions/6.0-79/topics/train
https://github.com/topepo/caret/blob/master/RegressionTests/Code/cforest.R
https://www.rdocumentation.org/packages/party/versions/1.2-4/topics/varimp
https://www.rdocumentation.org/packages/caret/versions/6.0-78/topics/varImp
https://cran.r-project.org/web/packages/HSAUR2/vignettes/Ch_recursive_partitioning.pdf
https://stackoverflow.com/questions/13751962/how-to-plot-a-large-ctree-to-avoid-overlapping-nodes