Classification and Regression Tree (CART)
Conditional Inference Tree (CIT)
Regression Model Tree (M5)
Bootstrap Aggregation (BAG)
Random Forest (RF)
Stochastic Gradient Boosting (SGB)
Rule-Based Cubist (CUBE)
Exercise 8.1
Exercise 8.2
Exercise 8.3
Exercise 8.4
Exercise 8.5
Exercise 8.6
Exercise 8.7
References

if (!require("gbm")) install.packages("gbm")
if (!require("pls")) install.packages("pls")
if (!require("caret")) install.packages("caret")
if (!require("ipred")) install.packages("ipred")
if (!require("party")) install.packages("party")
if (!require("rpart")) install.packages("rpart")
if (!require("RWeka")) install.packages("RWeka")
if (!require("Cubist")) install.packages("Cubist")
if (!require("mlbench")) install.packages("mlbench")
if (!require("partykit")) install.packages("partykit")
if (!require("elasticnet")) install.packages("elasticnet")
if (!require("randomForest")) install.packages("randomForest")
if (!require("AppliedPredictiveModeling")) install.packages("AppliedPredictiveModeling")
library(gbm)
library(pls)
library(caret)
library(ipred)
library(party)
library(rpart)
library(RWeka)
library(Cubist)
library(mlbench)
library(partykit)
library(elasticnet)
library(randomForest)
library(AppliedPredictiveModeling)

Note: Model descriptions are from Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Solutions are modifications of those posted by Max Kuhn on his public GitHub page. Function descriptions are from the RDocumentation website.

Classification and Regression Tree (CART)

Single Decision Tree. Tree-based models generate a set of conditions that are highly interpretable and are easy to implement. They can effectively handle many types of predictors (sparse, skewed, continuous, categorical, etc.) without the need to pre-process them (imputation, feature selection, etc.). In addition, these models do not require the user to specify the form of the predictors relationship to the response like, for example, a linear regression model requires. There are many techniques for constructing regression trees. One of the oldest and most utilized is the Classification and Regression Tree (CART) methodology also known as recursive partitioning. For regression, the model begins with the entire data set and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups such that the overall sums of squares error are minimized. Then within each of [the] groups, this method searches for the predictor and split value that best reduces SSE until the number of samples in the splits falls below some threshold. The fully-grown tree is then pruned back to a potentially smaller depth [to mitigate over-fitting]. If a predictor is never used in a split, the prediction equation is independent of those data.

Conditional Inference Tree (CIT)

Single Decision Tree. Conditional Inference Trees (CIT) describe a unified framework for unbiased tree-based models for regression, classification, and other scenarios. Statistical hypothesis tests are used to do an exhaustive search across the predictors and their possible split points. For a candidate split, a statistical test is used to evaluate the difference between the means of the two groups created by the split and a \(p\)-value can be computed for the test. Multiple comparison corrections reduce the number of false-positive test results that are incurred by conducting [several] statistical hypothesis tests. This reduces bias for highly granular data. This algorithm does not use pruning. As data sets are split, the decrease in the number of samples reduces the power of the hypothesis tests. This results in higher \(p\)-values and a lower likelihood of a new split (and over-fitting). Statistical hypothesis tests are not directly related to predictive performance and it is therefore still advisable to choose the complexity of the tree on the basis of performance (via resampling or some other means).

Regression Model Tree (M5)

Single Decision Tree. In Simple Regression Trees each terminal node uses the average of the training set outcomes in that node for prediction and therefore may not do a good job predicting samples whose true outcomes are extremely high or low. The Regression Model Tree approach implemented with the M5 algorithm uses different splitting criterion, predicts the outcome using a linear model, and results in a model that is often a combination of the predictions from different models along the same path through the tree. Like Simple Regression Trees, the initial split is found using an exhaustive search over the predictors and training set samples, but, unlike those models, the expected reduction in the node’s error rate is used. Model Trees also incorporate a type of smoothing to decrease the potential for over-fitting. This smoothing can have a significant positive effect on the Model Tree when the linear models across nodes are very different.

Bootstrap Aggregation (BAG)

Ensemble Method. Models based on single trees have particular weaknesses: (1) model instability (i.e., slight changes in the data can drastically change the structure of the tree or rules and, hence, the interpretation) and (2) less-than-optimal predictive performance. To combat these problems, researchers developed ensemble methods that combine many trees into one model. In the 1990s, ensemble techniques (methods that combine many models’ predictions) began to appear. Bagging (BAG), short for bootstrap aggregation, was one of the earliest developed ensemble techniques. Bagging is a general approach that uses bootstrapping in conjunction with any regression (or classification) model to construct an ensemble. Each model in the ensemble is used to generate a prediction for a new sample and these \(m\) predictions are averaged to give the bagged model’s prediction. For models that produce an unstable prediction, like regression trees, aggregating over many versions of the training data reduces the variance in the prediction and, hence, makes the prediction more stable. Bagging stable, lower variance models like linear regression and MARS offers less improvement in predictive performance. Although Bagging usually improves predictive performance for unstable models, results are less interpretable and computational costs are higher. CART (or conditional inference trees) can be used as the base learner in Bagging.

Random Forest (RF)

Ensemble Method. Generating bootstrap samples introduces a random component into the tree building process, which induces a distribution of trees, and therefore also a distribution of predicted values for each sample. The trees in Bagging, however, are not completely independent of each other since all the original predictors are considered at every split of every tree. This characteristic is known as tree correlation and prevents Bagging from optimally reducing variance of the predicted values. Random Forests (RF) are a unified model that [de-correlates trees by] adding randomness to the tree construction (learning) process [by implementating] random split selection[, building] entire trees based on random subsets of descriptors[, and] adding noise to the response in order to perturb tree structure. This ensemble of many independent, strong learners yields an improvement in error rates that is robust to a noisy response. At the same time, the independence of learners can underfit data when the response is not noisy. Compared to Bagging, Random Forests is more computationally efficient on a tree-by-tree basis. The ensemble nature of Random Forests makes it impossible to gain an understanding of the relationship between the predictors and the response. CART (or conditional inference trees) can be used as the base learner in Random Forests.

Stochastic Gradient Boosting (SGB)

Ensemble Method. Boosting models were originally developed for classification problems and were later extended to the regression setting. Boosting algorithms (gradient boosting machines) are influenced by learning theory [where] weak classifiers that predict marginally better than random are combined (or boosted) to produce an ensemble classifier with a superior generalized misclassification error rate. The basic principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an additive model that minimizes the loss function. The algorithm is typically initialized with the best guess of the response (e.g., the mean of the response in regression), the gradient (e.g., residual) is then calculated, and a model is then fit to the residuals to minimize the loss function. The current model is added to the previous model, and the procedure continues for a user-specified number of iterations. To mitigate greediness (choosing the optimal weak learner at each stage), a constraint is built into the learning process by employing regularization, or shrinkage [where] only a fraction of the current predicted value (commonly referred to as the learning rate) is added to the previous iteration’s predicted value. Stochastic Gradient Boosting (SGB) implements all of the aforementioned using random sampling during selection of the training data. This [final] modification improves the prediction accuracy of Boosting while also reducing the demand for computational resources.

Rule-Based Cubist (CUBE)

Single Decision Tree or Ensemble Method. A rule is defined as a distinct path through a tree. The number of samples affected by a rule is called its coverage. A rule-base model starts by creating an initial [unsmoothed] model tree [and then saving] only the rule with the largest coverage. The samples covered by the rule are removed from the training set and another model tree is created with the remaining data. Again, only the rule with the maximum coverage is retained. This process repeats until all the training set data have been covered by at least one rule. A new sample is then predicted by determining which rule(s) it falls under then the linear model associated with the largest coverage is applied. Pruning and smoothing must be applied judiciously as the impact of pruning on rules-based models is large and smoothing has a larger impact on unpruned models. Cubist (CUBE) is a rule-based model that is an amalgamation of several methodologies. [Previously] Cubist was previously only available in a commercial capacity, but in 2011 the source code was released under an open-source license. Cubist is different from previously described tree-based models and their rule-based variants in the techniques it uses for linear model smoothing, rule creation, and pruning. Cubist also employees an optional Boosting-like [ensemble] procedure called committees and predictions generated by the model rules can be adjusted using nearby points (neighbors) from the training set data. These nearest neighbors are determined using Manhattan (a.k.a. city block) distances and neighbors are only included if they are [not over the average distance] from the prediction sample.

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(624)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Exercise 8.1.a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

set.seed(624)
model1 <- randomForest(y ~ ., data=simulated, importance=TRUE, ntree=1000)

varimp1 <- varImp(model1)

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

Approach

Methods for fitting the random forest model and the variable importance score calculations are given in the question setup. The function randomForest() is used for fitting the model. The parameter values importance=TRUE and ntree=1000 dictate that the importance of predictors should be assessed and that the model should grow 1000 trees. The varImp() function calculates the variable importance for the model.

Results

varimp1

##        Overall
## V1  52.1858851
## V2  40.4759963
## V3  12.1019585
## V4  61.0955614
## V5  23.6547291
## V6   1.1061162
## V7  -2.3741595
## V8   0.1635707
## V9  -1.0509349
## V10  1.9515464

Interpretation

Per Exercise 7.2, Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: \(y=10\sin { \left( \pi x_{ 1 }x_{ 2 } \right) } +20\left( x_{ 3 }-0.5 \right) ^{ 2 }+10x_{ 4 }+5x_{ 5 }+N(0,\sigma ^{ 2 })\) where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). Applying the Random Forest model to these data and then calculating the variable importance reveals that the Random Forest model favors the informative predictors (V1 - V5) and does not significantly use the uninformative predictors (V6 - V10).

Exercise 8.1.b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(624)
duplicated <- simulated
duplicated$V11 <- duplicated$V1 + rnorm(200) * 0.1
cor(duplicated$V11, duplicated$V1)

## [1] 0.9466669

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Approach

Methods for adding an additional predictor that is highly correlated with one of the informative predictors are given in the question setup. This is done using the rnorm() function which produces pseudo-random Gaussian numbers. Specifically, the question setup adds a pseudo-random Gaussian number multiplied by 10% to each element in the V1 vector to create the V11 vector. The correlation between V1 and V11 is then confirmed with the base R cor() function. This new dataset from the question setup is fitted with a Random Forest model using the same randomForest() function. Again, the parameter values importance=TRUE and ntree=1000 dictate that the importance of predictors should be assessed and that the model should grow 1000 trees. The varImp() function then calculates the variable importance for the new model. The variable importance calculations for both models are then output in a data frame for comparison. To create the data frame, the first model needs to be padded with a V11 predictor that has a variable importance value of NA.

Results

set.seed(624)
model2 <- randomForest(y ~ ., data=duplicated, importance=TRUE, ntree=1000)

varimp1 <- varImp(model1)
varimp2 <- varImp(model2)
colnames(varimp1) <- "Simulated" 
colnames(varimp2) <- "Correlated"
padding <- data.frame(Simulated=NA, row.names="V11")
data.frame(rbind(varimp1, padding), varimp2)

##      Simulated Correlated
## V1  52.1858851 32.0082853
## V2  40.4759963 39.7176368
## V3  12.1019585 11.2705227
## V4  61.0955614 58.6501578
## V5  23.6547291 24.0180835
## V6   1.1061162  1.1338830
## V7  -2.3741595 -1.7153863
## V8   0.1635707  0.4555065
## V9  -1.0509349 -1.0899257
## V10  1.9515464  1.2979827
## V11         NA 26.4040559

Interpretation

Adding the predictor V11 that is highly correlated with the most informative predictor V1 does change the variable importance score of V1 as well as the importance scores of all the other informative and uninformative variables under the Random Forest (RF) model. The importance of this highly correlated V11 predictor even outranks two of the five informative variables. This exercise highlights a weakness in tree-based models. When variables are highly correlated, tree-based models have a tendency to select more predictors than are actually needed with the choice between highly correlated predictors being nearly random.

Exercise 8.1.c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Approach

The cforest() implementation of the random forest (and bagging) algorithm differs from randomForest() with respect to the base learners used and the aggregation scheme applied. Conditional inference Trees are fitted to each of the ntree bootstrap samples (500 or defined via cforest_control) of the learning sample. The cforest_control() hyperparameter of mtry which dictates the number of randomly preselected variables is set to \(P-1\), where \(P\) is the number of predictors. After the cforest() model is fit, there are two functions that can calculate the importance of the variables. The varimp() function spelled all in lowercase from the party package calculates the standard and conditional variable importance for cforest() models following the permutation principle of the “mean decrease in accuracy” importance found in the randomForest() function. The varImp() function with an uppercase “I” from the caret package is a generic method for calculating variable importance for objects produced by train() with specific methods such as cforest. For the sake of consistency, the varImp() function with an uppercase “I” is used. Again, the variable importance calculations for all the models are then output in a data frame for comparison. To create the data frame, models without the V11 predictor are padded with a V11 predictor that has a variable importance value of NA.

Results

set.seed(624)
ctrl1 <- cforest_control(mtry = ncol(simulated) - 1)
fit1 <- cforest(y ~ ., data = simulated, controls = ctrl1)
ctrl2 <- cforest_control(mtry = ncol(duplicated) - 1)
fit2 <- cforest(y ~ ., data = duplicated, controls = ctrl2)

varimp1 <- varImp(fit1)
varimp2 <- varImp(fit2)
varimp3 <- varImp(fit1, conditional=T)
varimp4 <- varImp(fit2, conditional=T)
colnames(varimp1) <- "Simulated_CIT_SI" 
colnames(varimp2) <- "Correlated_CIT_SI"
colnames(varimp3) <- "Simulated_CIT_CI" 
colnames(varimp4) <- "Correlated_CIT_CI"
padding1 <- data.frame(Simulated_CIT_SI=NA, row.names="V11")
padding2 <- data.frame(Simulated_CIT_CI=NA, row.names="V11")
data.frame(rbind(varimp1, padding1), rbind(varimp3, padding2), varimp2, varimp4)

##     Simulated_CIT_SI Simulated_CIT_CI Correlated_CIT_SI Correlated_CIT_CI
## V1       8.463440565      4.618481133       7.892100991       3.166580468
## V2       4.771477540      2.468352173       4.776313322       2.291296308
## V3       0.205261839      0.046998983       0.186140203       0.041731395
## V4      13.017103358      7.018594423      12.956593825       7.536054721
## V5       2.128344753      1.184514596       2.085292315       0.969294330
## V6       0.092270913      0.019074037       0.037602144       0.008036876
## V7       0.013105955      0.003591459       0.002315421      -0.003410836
## V8      -0.046276478     -0.018015592      -0.042813099      -0.017941796
## V9      -0.015857479     -0.007643672       0.011124091       0.011078896
## V10     -0.004241806     -0.016000361       0.009453343       0.009261656
## V11               NA               NA       0.445232778       0.076249725

Interpretation

Conditional variable importance scores are supposed to reflect “the true impact of each predictor variable more reliably than” the standard variable importance score calculation method. This however, does not seem to be the case here. For the Conditional Inference Trees (CIT), the conditional importance (CI) scores of informative variables follow the same ranking pattern as the standard importance (SI) scores for the simulated and highly correlated data. The pattern is nearly with same for the uninformative predictors as well. Inclusively, the decrease in importance of the informative variables when a highly correlated variable is included is greater when using the conditional importance (CI) scores.

Exercise 8.1.d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Approach

Unlike the ipredbagg() function from the ipred package, the bagging() function from the same package conducts bagging for classification, regression and survival trees using a formula interface (Y ~ X) instead of a non-formula interface (y=Y, x=X). The trees in the bagging() function are computed using the implementation in the rpart package. The nbagg parameter specifies the number of bootstrap replications. The gbm() function fits generalized boosted regression models. The distribution parameter, which is dependent on the response variable, is set to Gaussian since the response is not of Bernoulli (2 classes), Multinomial (more than 2 classes), or Survival (right censored) types. The cubist() function from the Cubist package fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The committees parameter specifies how many boosting interactions should be used in the ensemble. The Random Forest implementations randomForest() and cforest() are more flexible and reliable for computing Bootstrap Aggregated trees than the bagging() function and although they should generally be used instead of the bagging() function, the bagging() function is being used in this exercise for demonstration reasons. Again, the variable importance calculations for all the models are then output in a data frame for comparison. To create the data frame, models without the V11 predictor are padded with a V11 predictor that has a variable importance value of NA.

Results

set.seed(624)
fit3 <- bagging(y ~ ., simulated, nbagg = 50)
fit4 <- bagging(y ~ ., duplicated, nbagg = 50)

set.seed(624)
fit5 <- gbm(y ~ ., simulated, n.trees = 100, distribution = "gaussian")
fit6 <- gbm(y ~ ., duplicated, n.trees = 100, distribution = "gaussian")

set.seed(624)
fit7 <- cubist(x=simulated[,-11], y=simulated[,11], committees = 100)
fit8 <- cubist(x=duplicated[,-11], y=duplicated[,11], committees = 100)

varimp1 <- varImp(fit3)
varimp2 <- varImp(fit4)
varimp3 <- varImp(fit5, numTrees = 100)
varimp4 <- varImp(fit6, numTrees = 100)
varimp5 <- varImp(fit7)
varimp6 <- varImp(fit8)
colnames(varimp1) <- "Sim_BAG"; colnames(varimp2) <- "Cor_BAG"
colnames(varimp3) <- "Sim_SGB"; colnames(varimp4) <- "Cor_SGB"
colnames(varimp5) <- "Sim_CUBE"; colnames(varimp6) <- "Cor_CUBE"
padding1 <- data.frame(Sim_BAG=NA, row.names="V11")
padding2 <- data.frame(Sim_SGB=NA, row.names="V11")
padding3 <- data.frame(Sim_CUBE=NA, row.names="V11")
data.frame(rbind(varimp1, padding1),  
  rbind(varimp3, padding2),
  rbind(varimp5, padding3), 
  varimp2, varimp4, varimp6)

##       Sim_BAG    Sim_SGB Sim_CUBE   Cor_BAG    Cor_SGB Cor_CUBE
## V1  2.1613930  7873.1807     64.0 2.1429383  9506.4479     61.0
## V10 0.8968806  1629.1682     65.5 0.7543033  1599.3984     63.0
## V2  2.3825483     0.0000     46.5 1.8380012     0.0000     48.0
## V3  1.5342296 58742.6302     35.5 2.1571350 55112.2638     38.0
## V4  2.0067900   506.7549     25.5 1.3959607   472.4445      8.5
## V5  2.4033751     0.0000      2.5 2.0409922     0.0000     25.0
## V6  0.9973661     0.0000      0.5 2.3613080     0.0000      1.0
## V7  0.7475497     0.0000      4.0 0.6820392     0.0000      4.0
## V8  0.7030260     0.0000      3.0 0.6317266     0.0000      2.0
## V9  0.7401711     0.0000      2.5 0.5615145     0.0000      1.5
## V11        NA         NA       NA 0.7099651  1098.1727      0.5

Interpretation

Adding the predictor V11 that is highly correlated with the most informative predictor V1 has differing effects on the variable importance score of V1 as well as the importance scores of all the other informative and uninformative variables. The Bootstrap Aggregation (BAG) method performs worse than the Conditional Inference Trees (CIT) from the previous exercise. Adding the highly correlated variable makes the BAG model interpret the correlated variable V11 as unimportant, but one of the uninformative variables (V6) as the most important. The Stochastic Gradient Boosting (SGB) method performs poorly as well with one uninformative variables (V10) being listed as important and two informative variables (V2, V5) being listed as unimportant. The Rule-Based Cubist (CUBE) model fairs better than the other models by not marking any informative variables as unimportant, but it also lists an uninformative variable (V10) as the most important variable. This exercise highlights a weakness in tree-based models. When variables are highly correlated, tree-based models have a tendency to select more predictors than are actually needed with the choice between highly correlated predictors being nearly random.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

Approach

Single Regression Trees suffer from selection bias. Predictors with a higher number of distinct values (lower variance) are favored over more granular (higher variance) predictors. The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. There is therefore a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all. Also, as the number of missing values increases, the selection of predictors becomes more biased. This phenomenon is best exemplified by creating a high variance predictor that is correlated to the response variable and a second predictor with less variance that is not related to the response variable. Specifically, a predictor X1 with 100 one values and 100 two values is created. Then a response variable Y is created that is equal to X1 plus a pseudo-random Gaussian variable with \(\mu=0, \sigma=2\). A completely unrelated higher variance pseudo-random Gaussian variable X2 with \(\mu=0, \sigma=4\) is then created. These variables are then placed in a data frame and modeled using the rpart() function which makes splits based on the CART methodology. The varImp() function then calculates the variable importance for the model.

Results

set.seed(624)
X1 <- rep(1:2, each=100)
Y <- X1 + rnorm(200, mean=0, sd=4)
set.seed(624)
X2 <- rnorm(200, mean=0, sd=2)
simData <- data.frame(Y=Y, X1=X1, X2=X2)
set.seed(624)
fit <- rpart(Y ~ ., data = simData)

varImp(fit)

##      Overall
## X1 0.8849928
## X2 4.7209459

Interpretation

The lower variance predictor X1 which is related to Y is found to be less important than the higher variance X2 predictor which is completely unrelated to Y. The selection bias pitfall of Single Regression Trees is highlighted well in this example. The importance of the unrelated X2 predictor is not just greater, it is substantially greater. The calculated importance value of the unelated X2 predictor is over five times larger than that of the related X1 predictor.

Exercise 8.3

In Stochastic Gradient Boosting the Bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. [The below generates] variable importance plots for boosting using two extreme values for the Bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data.

data(solubility)
X <- solTrainXtrans; Y <- solTrainY
indx <- createFolds(Y, returnTrain = T)
ctrl <- trainControl(method = "cv", index = indx)

set.seed(624)
tg1 <- expand.grid(interaction.depth=seq(1,7,by=2), 
  n.trees=seq(100,1000,by=100), shrinkage = c(0.01, 0.1), n.minobsinnode=10)
tune1 <- train(X, Y, method="gbm", tuneGrid=tg1, trControl=ctrl, verbose=F)
int_dpth <- tune1$bestTune$interaction.depth
ntrees <- tune1$bestTune$n.trees

set.seed(624)
tg2 <- expand.grid(interaction.depth=int_dpth,
  n.trees=ntrees, shrinkage=0.1, n.minobsinnode=10)
tune2 <- train(X, Y, method = "gbm", tuneGrid=tg2,
  trControl=ctrl, bag.fraction=0.1, verbose=F)

set.seed(624)
tg3 <- expand.grid(interaction.depth=int_dpth,
  n.trees=ntrees, shrinkage=0.9, n.minobsinnode=10)
tune3 <- train(X, Y, method = "gbm", tuneGrid=tg3, 
  trControl=ctrl, bag.fraction=0.9, verbose=F)

A comparison of variable importance magnitudes for differing values of the Bagging fraction and shrinkage parameters. The left-hand plot has both [tuning] parameters set to 0.1, and the right-hand plot has both [tuning parameters] set to 0.9:

varimp1 <- varImp(tune2, scale=F); varimp2 <- varImp(tune3, scale=F)
plot1 <- plot(varimp1, top=25, scales=list(y=list(cex=.95)))
plot2 <- plot(varimp2, top=25, scales=list(y=list(cex=.95)))
print(plot1, split=c(1,1,2,1), more=T); print(plot2, split=c(2,1,2,1))

Exercise 8.3.a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Approach

The train() function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid parameter of the train() function is fed a data frame with values in columns named after the parameters being tuned. The hyperparameters declared in the tuning grid to be evaluated for the gbm (SGD) model are given in the question setup. Models with various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode) are compared. Then the best interaction depth and number of iterations (fitted trees) from the tuning are examined more closely with their own SGB model. The first SGB model uses a 0.1 learning rate and selects 0.1 of the training set observations (bag.fraction) to propose the next tree in the expansion. The second SGB model uses a 0.9 learning rate and selects 0.9 of the training set observations (bag.fraction) to propose the next tree in the expansion.

Results

tune1$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 75     500                 7       0.1             10

varimp1 <- varImp(tune1)$importance
varimp2 <- varImp(tune2)$importance
df <- data.frame(sort(varimp1$Overall, T), sort(varimp2$Overall, T))
colnames(df) <- c("ShrinkBag_0.1" , "ShrinkBag_0.9")
head(df, 15)

##    ShrinkBag_0.1 ShrinkBag_0.9
## 1     100.000000    100.000000
## 2      75.971280     46.330520
## 3      39.558941     45.716197
## 4      37.390117     43.425955
## 5      22.844465     36.992822
## 6      11.207339     23.950815
## 7       9.884140     22.157978
## 8       5.313661     17.196800
## 9       5.233103     15.727436
## 10      4.616990     13.632785
## 11      4.079079     13.384229
## 12      3.827388     12.159232
## 13      3.771225     10.210433
## 14      3.059019      9.302018
## 15      2.877306      8.447839

Interpretation

This question is somewhat hard to answer. According to the author, model greediness (choosing the optimal weak learner at each stage) which decreases the number of predictors, increases proportional to the learning rate. Therefore, the lower learning rate of 0.1 in the left model results in fewer predictors than the model on the right which has a 0.9 learning rate. This number of predictors is also proportional to the stochastic nature of the model such that increased randomness results in an increase in the number of predictors. Therefore, the lower bagging fraction of 0.1 in the left model results in fewer predictors than the more deterministic model on the right which selects 0.9 of the training set observations to propose the next tree in the expansion.

Exercise 8.3.b

Which model do you think would be more predictive of other samples?

Approach

The first SGB models uses a 0.1 learning rate and selects 0.1 of the training set observations (bag.fraction) to propose the next tree in the expansion. The second SGB models uses a 0.9 learning rate and selects 0.9 of the training set observations (bag.fraction) to propose the next tree in the expansion. The bagging fraction introduces randomness to the model that improves the predictive ability of the model.

Results

data.frame(model="SGB_0.1", tune2$bestTune, RMSE=min(tune2$results$RMSE), row.names="")

##    model n.trees interaction.depth shrinkage n.minobsinnode      RMSE
##  SGB_0.1     500                 7       0.1             10 0.6524385

data.frame(model="SGB_0.9", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")

##    model n.trees interaction.depth shrinkage n.minobsinnode      RMSE
##  SGB_0.9     500                 7       0.9             10 0.8252062

Interpretation

Greedy models are less likely to select the optimal global model and are prone to over-fitting. Stochastic models reduce prediction variance. Therefore, the less greedy model on the left (with a 0.1 learning rate) that is also more random (due to only selecting 0.1 of the training set observations to propose the next tree in the expansion) would be more predictive of other samples. The RMSE metric for the respective models supports this conclusion.

Exercise 8.3.c

How would increasing interaction depth affect the slope of predictor importance for either model?

Approach

The first SGB models uses a 0.1 learning rate and selects 0.1 of the training set observations (bag.fraction) to propose the next tree in the expansion. The second SGB models uses a 0.9 learning rate and selects 0.9 of the training set observations (bag.fraction) to propose the next tree in the expansion. The learning rate shrinks the RMSE by constraining the tree depth with a penalty.

Results

plot(tune1)

Interpretation

The positive impact of interaction depth on RMSE is proportional to the learning rate and number of trees. A higher learning rate in these data results in improvements up to the 300-tree mark. Models with a lower learning rate see greater improvements from interaction depth than models with a higher learning rate. Tree depth and learning rates impact the number of predictors which impacts the model greediness which impacts the predictive ability of the model which impacts the RMSE; all proportionally. Therefore, by transitivity, an increase in interaction depth will increase the number of predictors and RMSE. Variable importance will be spread across more predictors and the RMSE-Iterations graph will have a higher intercept that increases the negative slope.

Exercise 8.4

Use a single predictor in the solubility data, such as the molecular weight or the number of carbon atoms and fit several models:

data(solubility)
SOL_train_X <- subset(solTrainXtrans, select="MolWeight")
SOL_train_Y <- solTrainY
SOL_test_X <- subset(solTestXtrans, select="MolWeight")
SOL_test_Y <- solTestY

Exercise 8.4.a

A simple regression tree.

Approach

Results

set.seed(624)
tune1 <- train(SOL_train_X, SOL_train_Y, method = "rpart2", tuneLength = 1)
min(tune1$results$RMSE)

## [1] 1.663586

Interpretation

Fitting a CART model between the Molecular Weight predictor and the solubility dataset target variable results in a 1.663586 RMSE.

Exercise 8.4.b

A random forest model.

Approach

Results

set.seed(624)
tune2 <- train(SOL_train_X, SOL_train_Y, method = "rf", tuneLength = 1)
min(tune2$results$RMSE)

## [1] 1.562868

Interpretation

Fitting a RF model between the Molecular Weight predictor and the solubility dataset target variable results in a 1.562868 RMSE.

Exercise 8.4.c

Different Cubist models with a single rule or multiple committees (each with and without using neighbor adjustments).

Approach

The training and test sets are predefined by the solubility data. Selecting the molecular weight subset involves a simple application of the base R subset() function on the dataset selecting the MolWeight field. The train() function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid parameter of the train() function is fed a data frame with values in columns named after the parameters being tuned. The cubist method in the train() function implements a rule-based M5 model with additional corrections based on nearest neighbors in the training set using the Cubist package. The hyperparameters declared in the tuning grid to be evaluated are the committees parameter which specifies how many boosting interactions should be used in the ensemble, and the neighbors parameter which specifies the number of instances to use to correct the rule-based prediction.

Results

set.seed(624)
tg3 <- expand.grid(committees = 1, neighbors = 0)
tune3 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg3)
min(tune3$results$RMSE)

## [1] 1.551064

set.seed(624)
tg4 <- expand.grid(committees = 1, neighbors = c(1,3,5,7))
tune4 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg4)
min(tune4$results$RMSE)

## [1] 1.619041

set.seed(624)
tg5 <- expand.grid(committees = 100, neighbors = 0)
tune5 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg5)
min(tune5$results$RMSE)

## [1] 1.5287

set.seed(624)
tg6 <- expand.grid(committees = 100, neighbors = c(1,3,5,7))
tune6 <- train(SOL_train_X, SOL_train_Y, method="cubist", verbose=F, metric="Rsquared", tuneGrid=tg6)
min(tune6$results$RMSE)

## [1] 1.602018

Interpretation

Fitting CUBE models between the Molecular Weight predictor and the solubility dataset target variable result in a RMSE ranging between [1.528700, 1.619041].

Exercise 8.4.d

Plot the predictor data versus the solubility results for the test set. Overlay the model predictions for the test set. How do the models differ? Does changing the tuning parameter(s) significantly affect the model fit?

Approach

The predict() function applies a fitted model to data consisting of predictor variables. A function is created to plot the predictor test data versus the solubility test results then overlay the predictions of the six models. The postResample() function takes two vectors as its inputs and computes the RMSE, \(R^2\), and MAE performance metrics.

Results

models <- c("CART","RF","CUBE_1.0","CUBE_1.n","CUBE_C.0","CUBE_C.n")
fcast1 <- predict(tune1, newdata = SOL_test_X)
fcast2 <- predict(tune2, newdata = SOL_test_X)
fcast3 <- predict(tune3, newdata = SOL_test_X)
fcast4 <- predict(tune4, newdata = SOL_test_X)
fcast5 <- predict(tune5, newdata = SOL_test_X)
fcast6 <- predict(tune6, newdata = SOL_test_X)
fcast <- cbind(fcast1, fcast2, fcast3, fcast4, fcast5, fcast6)
par(mfrow=c(3, 2), mar = c(0, 1, 2, 0), oma = c(0, 0, 0.5, 0.5))
for (i in 1:6) {
  X <- SOL_test_X$MolWeight; Y <- SOL_test_Y
  plot(X, Y, xaxt="n", yaxt="n", xlab="", ylab="", main=models[i])
  par(new=T); 
  plot(X, fcast[,i], xaxt="n", yaxt="n", xlab="", ylab="", col="blue")
}

data.frame(row.names = models, rbind(
  postResample(pred = fcast1, obs = SOL_test_Y), 
  postResample(pred = fcast2, obs = SOL_test_Y), 
  postResample(pred = fcast3, obs = SOL_test_Y),
  postResample(pred = fcast4, obs = SOL_test_Y),
  postResample(pred = fcast5, obs = SOL_test_Y),
  postResample(pred = fcast6, obs = SOL_test_Y)))

##              RMSE  Rsquared       MAE
## CART     1.712811 0.3219683 1.3079220
## RF       1.333482 0.5921083 0.9325201
## CUBE_1.0 1.543414 0.4862360 1.0864710
## CUBE_1.n 1.585349 0.4235102 1.1716094
## CUBE_C.0 1.540014 0.4881868 1.0856421
## CUBE_C.n 1.586609 0.4223147 1.1720924

Interpretation

Overlaying the model predictions on the test set is very useful in highlighting the differences in the models. The CART model predicts values within two categories that does not fit the data very well. The RF model fits the data much better but its predictions are the most heteroskedastic of all the fitted models. The CUBE models appear perform the best but there are differences between them. Although the number of committees (which specifies how many boosting interactions should be used in the ensemble) has little to no effect in the CUBE model predictions, the number of neighbors (which specifies the number of instances to use to correct the rule-based prediction) has a significant impact. Declaring zero neighbors makes the CUBE model predictions homoskedastic to a fault and returns a hockey-stick shaped prediction reminiscent of nonlinear MARS models. Declaring multiple neighbors adds variance to the prediction which improves the fit. Interestingly enough, the RMSE, \(R^2\) and MAE performance metrics indicate that the RF model with the most heteroskedasticity is the best fit for these data.

Exercise 8.5

Fit different tree- and rule-based models for the Tecator data discussed [below]. How do they compare to linear models? Do the between-predictor correlations seem to affect your models? If so, how would you transform or re-encode the predictor data to mitigate this issue?

Infrared (IR) spectroscopy technology is used to determine the chemical makeup of a substance. The theory of IR spectroscopy holds that unique molecular structures absorb IR frequencies differently. In practice a spectrometer fires a series of IR frequencies into a sample material, and the device measures the absorbance of the sample at each individual frequency. This series of measurements creates a spectrum profile which can then be used to determine the chemical makeup of the sample material. A Tecator Infratec Food and Feed Analyzer instrument was used to analyze 215 samples of meat across 100 frequencies. In addition to an IR profile, analytical chemistry determined the percent content of water, fat, and protein for each sample. If we can establish a predictive relationship between IR spectrum and fat content, then food scientists could predict a sample’s fat content with IR instead of using analytical chemistry. This would provide costs savings, since analytical chemistry is a more expensive, time-consuming process.

data(tecator)
set.seed(624)
rows_train <- createDataPartition(endpoints[, 3], p = 3/4, list= FALSE)
TEC_train_X <- as.data.frame(absorp[rows_train, ])
TEC_test_X <- as.data.frame(absorp[-rows_train, ])
TEC_train_Y <- endpoints[rows_train, 3]
TEC_test_Y <- endpoints[-rows_train, 3]
ctrl <- trainControl(method = "repeatedcv", repeats = 5)

set.seed(624)
tune1 <- train(x = TEC_train_X, y = TEC_train_Y, method="pls",
  trControl = ctrl, preProcess = c("center", "scale"), tuneLength = 25)

Approach

The tecator data partitioning, controls, and Partial Least Squares (PLS) model are given in the question setup. The createDataPartition() function helps create a series of test and training partitions. The p parameter sets the percentage of data that goes to training. Declaring list=FALSE returns a matrix. The trainControl() function controls the computational nuances of the train() function. The repeatedcv resampling method and number parameters are specific the type of \(k\)-fold cross-validation to be implemented. The train() function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid parameter of the train() function is fed a data frame with values in columns named after the parameters being tuned. CART is implemented in the train() function using the rpart method which tunes the model over the complexity parameter (right-sized tree based on RMSE). BAG is implemented in the train() function using the treebag method which conducts bagging for classification, regression and survival trees using the implementation in the rpart package. RF is implemented in the train() function using the rf method which implements Random Forest classification and regression using the randomForest package. SGB is implemented in the train() function using the gbm method fits generalized boosted regression models. The hyperparameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode). CUBE is implemented in the train() function using the cubist method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The hyperparameters declared in the tuning grid to be evaluated are the committees parameter which specifies how many boosting interactions should be used in the ensemble, and the neighbors parameter which specifies the number of instances to use to correct the rule-based prediction. A function is created to help display the resampling performance metrics as a data frame. The bwplot() function plots a series of vertical box-and-whisker plots for the resampling performance metrics.

Results

set.seed(624)
tune2 <- train(x = TEC_train_X, y = TEC_train_Y, 
  method="rpart", trControl = ctrl, tuneLength = 25)

set.seed(624)
tune3<- train(x = TEC_train_X, y = TEC_train_Y, 
  method="treebag", trControl = ctrl)

set.seed(624)
tune4<- train(x = TEC_train_X, y = TEC_train_Y, method = "rf",
  ntree = 1500, tuneLength = 10, trControl = ctrl)

set.seed(624)
tg5 <- expand.grid(interaction.depth = seq(1, 7, by = 2),
  n.trees = seq(100, 1000, by = 50), shrinkage = c(0.01, 0.1), n.minobsinnode=10)
tune5 <- train(x = TEC_train_X, y = TEC_train_Y, method = "gbm",
  verbose = FALSE, tuneGrid = tg5, trControl = ctrl)

set.seed(624)
tg6 <- expand.grid(committees = c(1:10, 20, 50, 75, 100), neighbors = c(0, 1, 5, 9))
tune6 <- train(x = TEC_train_X, y = TEC_train_Y, method = "cubist",
  verbose = FALSE, tuneGrid = tg6, trControl = ctrl)

Training Set Resampling

metrics <- function(tune) {
  RMSE = min(tune$results$RMSE)
  Rsquared = max(tune$results$Rsquared)
  MAE = min(tune$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2), metrics(tune3), 
  metrics(tune4), metrics(tune5), metrics(tune6)), 
  row.names = c("PLS","CART","BAG","RF","SGB","CUBE"))

##           RMSE  Rsquared       MAE
## PLS  0.5641773 0.9667829 0.4537333
## CART 2.6292768 0.3216989 2.0270089
## BAG  2.3087450 0.4360403 1.8714800
## RF   2.1175918 0.5259554 1.6867034
## SGB  1.9179625 0.6075514 1.4417302
## CUBE 0.5795218 0.9644043 0.4416561

fits <- list(PLS=tune1, CART=tune2, BAG=tune3, RF=tune4, SGB=tune5, CUBE=tune6)
bwplot(resamples(fits))

Interpretation

After fitting a CART, BAG, RF, SGB, and CUBE model to the tecator data which were previously found to be best suited for fitting with a linear PLS model, it appears that a CUBE model is also well suited for modeling the tecator data. The CUBE and PLS models outperform the other tree-based models in every resampling performance metric. The RMSE and \(R^2\) resampling performance metrics of the PLS model are superior to those of the CUBE model however. The CUBE model only outperforms the PLS model in the MAE resampling performance metric. Based on these performance metrics, the linear PLS model still appears to provide the best fit for the tecator data (although the CUBE model comes in at a close second). In Exercise 8.2 it was found that predictor correlations do have an impact CUBE models so removing highly correlated variables using findCorrelation(cor(.)) may yield improvements that can make a CUBE model a better fit for the tecator data that a PLS model.

Exercise 8.6

Return to the permeability problem described [below]. Train several tree-based models and evaluate the resampling and test set performance.

Permeability is the measure of a molecule’s ability to cross a membrane. Compounds that appear to be effective for a particular disease in research screening experiments but appear to be poorly permeable may need to be altered in order to improve permeability and thus the compound’s ability to reach the desired target. Permeability assays such as PAMPA and Caco-2 have been developed to help measure compounds’ permeability. These screens are effective at quantifying a compound’s permeability, but the assay is expensive labor intensive. Given a sufficient number of compounds that have been screened, we could develop a predictive model for permeability in an attempt to potentially reduce the need for the assay. In this project there were 165 unique compounds; 1,107 molecular fingerprints were determined for each. A molecular fingerprint is a binary sequence of numbers that represents the presence or absence of a specific molecular substructure. The response is highly skewed, the predictors are sparse, and many predictors are strongly associated. Developing a model to predict permeability could save significant resources for a pharmaceutical company, while at the same time more rapidly identifying molecules that have a sufficient permeability to become a drug.

library(AppliedPredictiveModeling)
data(permeability)
nzv <- nearZeroVar(fingerprints)
fingerprints <- fingerprints[,-nzv]
set.seed(624)
rows_train <- createDataPartition(permeability, p = 0.75, list = FALSE)
PER_train_X <- fingerprints[rows_train,]
PER_train_Y <- permeability[rows_train,]
PER_test_X <- fingerprints[-rows_train,]
PER_test_Y <- permeability[-rows_train,]
ctrl <- trainControl(method = "LGOCV")

set.seed(624)
tg1 <- expand.grid(lambda = c(0, 0.05, .1), fraction = seq(.05, 1, length = 25))
tune1 <- train(x = PER_train_X, y = log10(PER_train_Y), 
  method = "enet", tuneGrid = tg1, trControl = ctrl)

set.seed(624)
tg2 <- expand.grid(sigma = c(0.0005,0.001,0.0015), C = seq(1,49,by=6))
tune2 <- train(x = PER_train_X, y = log10(PER_train_Y), 
  method = "svmRadial", tuneGrid = tg2, trControl = ctrl)

Exercise 8.6.a

Which tree-based model gives the optimal resampling and test set performance?

Approach

The permeability data partitioning, transformation, controls, Elastic Net Regression (ENET) model, and Support Vector Machines (SVM) model are given in the question setup. The createDataPartition() function helps create a series of test and training partitions. The p parameter sets the percentage of data that goes to training. Declaring list=FALSE returns a matrix. The trainControl() function controls the computational nuances of the train() function. The LGOCV resampling method and number parameters specific the type of leave-group-out cross-validation to be implemented. The train() function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model, and calculates a resampling based performance measure. The tuneGrid parameter of the train() function is fed a data frame with values in columns named after the parameters being tuned. CART is implemented in the train() function using the rpart2 method which tunes the model over the maximum depth (of the tree). RF is implemented in the train() function using the rf method which implements Random Forest classification and regression using the randomForest package. SGB is implemented in the train() function using the gbm method fits generalized boosted regression models. The hyperparameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode). A function is created to help display the resampling performance metrics as a data frame. The bwplot() function plots a series of vertical box-and-whisker plots for the resampling performance metrics. The predict() function applies a fitted model to data consisting of predictor variables. A data frame with the test set performance metrics is then displayed.

Results

set.seed(624)
tg3 <- expand.grid(maxdepth= seq(1,10,by=1))
tune3 <- train(x = PER_train_X, y = log10(PER_train_Y),
  method = "rpart2", tuneGrid = tg3, trControl = ctrl)

set.seed(624)
tune4 <- train(x = PER_train_X, y = log10(PER_train_Y),
  method = "rf", tuneLength = 10, importance = TRUE, trControl = ctrl)

set.seed(624)
tg5 <- expand.grid(interaction.depth=seq(1,6,by=1), n.minobsinnode=10, 
  n.trees=c(25,50,100,200), shrinkage=c(0.01,0.05,0.1))
tune5 <- train(x = PER_train_X, y = log10(PER_train_Y),
  method = "gbm", verbose = F, tuneGrid = tg5, trControl = ctrl)

Training Set Resampling

metrics <- function(tune) {
  RMSE = min(tune$results$RMSE)
  Rsquared = max(tune$results$Rsquared)
  MAE = min(tune$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune3), metrics(tune4),
  metrics(tune5)), row.names = c("CART","RF","SGB"))

##           RMSE  Rsquared       MAE
## CART 0.5154786 0.4591854 0.3802313
## RF   0.4869323 0.5249267 0.3657372
## SGB  0.4701520 0.5548857 0.3613686

fits <- list(CART=tune3, RF=tune4, SGB=tune5)
bwplot(resamples(fits))

Validation on Test Set

fcast3 <- predict(tune3, newdata = PER_test_X)
fcast4 <- predict(tune4, newdata = PER_test_X)
fcast5 <- predict(tune5, newdata = PER_test_X)
data.frame(rbind(postResample(pred = fcast3, obs = PER_test_Y),
  postResample(pred = fcast4, obs = PER_test_Y),
  postResample(pred = fcast5, obs = PER_test_Y)), 
  row.names = c("CART","RF","SGB"))

##          RMSE  Rsquared      MAE
## CART 18.75267 0.3466993 11.66874
## RF   18.69633 0.4892234 11.60125
## SGB  18.71166 0.4054850 11.59132

Interpretation

After fitting tree-based CART, RF, and SGB models to the permeability data which were previously found to be best suited for fitting with a linear ENET or a nonlinear SVM model, it appears that most suitable tree-based model fit for the permeability data is an SGB model. The SGB model outperforms the other tree-based models in every resampling performance metric. The RF model does outperform the other tree-based models on the test set, but these test set metrics are less important to predictive ability than the resampling performance metrics.

Exercise 8.6.b

Do any of these models outperform the covariance or non-covariance based regression models you have previously developed for these data? What criteria did you use to compare models’ performance?

Approach

A function is created to help display the resampling performance metrics as a data frame. The bwplot() function plots a series of vertical box-and-whisker plots for the resampling performance metrics. A data frame with the test set performance metrics is also displayed.

Results

metrics <- function(tune) {
  RMSE = min(tune$results$RMSE)
  Rsquared = max(tune$results$Rsquared)
  MAE = min(tune$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2), 
  metrics(tune3), metrics(tune4), metrics(tune5)),
  row.names = c("ENET","SVM","CART","RF","SGB"))

##           RMSE  Rsquared       MAE
## ENET 0.4673019 0.5684217 0.3533567
## SVM  0.5118277 0.4741713 0.3859506
## CART 0.5154786 0.4591854 0.3802313
## RF   0.4869323 0.5249267 0.3657372
## SGB  0.4701520 0.5548857 0.3613686

fits <- list(ENET=tune1, SVM=tune2, CART=tune3, RF=tune4, SGB=tune5)
bwplot(resamples(fits))

fcast1 <- predict(tune1, newdata = PER_test_X)
fcast2 <- predict(tune2, newdata = PER_test_X)
fcast3 <- predict(tune3, newdata = PER_test_X)
fcast4 <- predict(tune4, newdata = PER_test_X)
fcast5 <- predict(tune5, newdata = PER_test_X)
data.frame(rbind(postResample(pred = fcast1, obs = PER_test_Y), 
  postResample(pred = fcast2, obs = PER_test_Y), 
  postResample(pred = fcast3, obs = PER_test_Y),
  postResample(pred = fcast4, obs = PER_test_Y),
  postResample(pred = fcast5, obs = PER_test_Y)), 
  row.names = c("ENET","SVM","CART","RF","SGB"))

##          RMSE  Rsquared      MAE
## ENET 18.71909 0.4051683 11.56402
## SVM  18.65525 0.3940251 11.50458
## CART 18.75267 0.3466993 11.66874
## RF   18.69633 0.4892234 11.60125
## SGB  18.71166 0.4054850 11.59132

Interpretation

After fitting tree-based CART, RF, and SGB models to the permeability data which were previously found to be best suited for fitting with a covariance (linear) ENET or non-covariance (nonlinear) SVM model, it appears that most suitable model for the permeability data is still the ENET model. The ENET model outperforms the SVM and tree-based models in every resampling performance metric. Other models outperform the ENET model on the test set, but these test set metrics are less important to predictive ability than the resampling performance metrics.

Exercise 8.6.c

Of all the models you have developed thus far, which, if any, would you recommend to replace the permeability laboratory experiment?

Approach

The train() function, once computed and stored to a variable, contains several attributes which help evaluate model performance.

Results

tune1$modelInfo$label
data.frame(model="ENET", tune1$bestTune, RMSE=min(tune1$results$RMSE), row.names="")
plot(tune1)

## [1] "Elasticnet"
##  model  fraction lambda      RMSE
##   ENET 0.1291667    0.1 0.4673019

tune3$modelInfo$label
data.frame(model="CART", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")
plot(tune3)

## [1] "CART"
##  model maxdepth      RMSE
##   CART        3 0.5154786

Interpretation

Using the resampling performance metrics as a guide in choosing the recommendation of a replacement for the permeability laboratory experiment, it would advisable to opt for the ENET model since the ENET model performs better than the SVM and tree-based models. If, however, a more interpretable model is necessary, the CART model would be a good choice. Its resampling performance is relatively close to that of the ENET model and it is more interpretable by non-technical persons.

Exercise 8.7

Refer to Exercise 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

CMP <- get(data(ChemicalManufacturingProcess))
rm(ChemicalManufacturingProcess) # list not needed
set.seed(624)
rows_train <- createDataPartition(CMP$Yield, p=0.75, list=F)
CMP_train <- CMP[rows_train, ]
CMP_test <- CMP[-rows_train, ]
prepro <- preProcess(subset(CMP_train, select = - Yield),
  method=c("BoxCox", "center", "scale", "knnImpute"))
CMP_train_X <- predict(prepro, CMP_train[ , -1])
nzv <- nearZeroVar(CMP_train_X)
mcl <- findCorrelation(cor(CMP_train_X))
CMP_train_X <- CMP_train_X[ ,-c(nzv, mcl)]

set.seed(624)
ctrl1 <- trainControl(method = "boot", number = 25)
tune1 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method="pls", tuneLength=15, trControl=ctrl1)

set.seed(624)
ctrl2 <- trainControl(method = "cv")
tg2 <- data.frame(.k = 1:20)
tune2 <- train(x = CMP_train_X, y = CMP_train$Yield,
  method = "knn", tuneGrid = tg2, trControl = ctrl2)

Classification and Regression Tree (CART)

CART is implemented in the train() function using the rpart2 method which tunes the model over the maximum depth (of the tree).

set.seed(624)
ctrl3 <- trainControl(method = "boot", number = 25)
tg3 <- expand.grid(maxdepth= seq(1,10,by=1))
tune3 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "rpart2",
  metric = "Rsquared", tuneGrid = tg3, trControl = ctrl3)

Conditional Inference Tree (CIT)

CIT is implemented in the train() function using the ctree2 method where the number of randomly selected predictors to choose from at each split is set by mtry. Since Random Forests is computationally intensive, mtry is tuned with around five values that are somewhat evenly spaced across the range from 2 to \(P\), where \(P\) is the number of predictors.

set.seed(624)
ctrl4 <- trainControl(method = "cv")
tg4 <- expand.grid(maxdepth= seq(1,10,by=1), mincriterion=1-c(0.01, 0.05, 0.1))
tune4 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "ctree2",
  metric = "Rsquared", tuneGrid = tg4, trControl = ctrl4)

Regression Model Tree (M5)

M5 is implemented in the train() function using the M5 method from the RWeka package. The prune, smooth, and rules hyperparameters evaluate models with and without those features.

set.seed(624)
yn <- c("Yes","No")
ctrl5 <- trainControl(method = "cv")
tg5 <- expand.grid(pruned = yn, smoothed = yn, rules = yn)
tune5 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "M5",
  metric = "Rsquared", tuneGrid = tg5, trControl = ctrl5)

Bootstrap Aggregation (BAG)

BAG is implemented in the train() function using the bag method which conducts bagging for classification and regression using the implementation in the caret package. The bagControl() function controls the computational nuances of this train() function method. The hyperparameters declared in the tuning grid are fit, predict and aggregate which all take a function that provides a framework for bagging classification or regression models. The function used for the bagging predictions is the tree-based function ctreeBag.

n <- ncol(CMP_train_X) 
seq_len(abs(n))[n %% seq_len(abs(n)) == 0L]

##  [1]  1  2  3  4  6  8 12 16 24 48

set.seed(624)
bag6 <- bagControl(fit=ctreeBag$fit, predict=ctreeBag$pred, aggregate=ctreeBag$aggregate)
ctrl6 <-trainControl(method = "cv")
tg6 <- data.frame(vars = seq(2, n, 11))
tune6 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "bag", B = 25, 
  metric = "Rsquared", tuneGrid = tg6, trControl = ctrl6, bagControl = bag6)

Random Forest (RF)

RF is implemented in the train() function using the rf method which implements Random Forest classification and regression using the randomForest package.

set.seed(624)
ctrl7 <- trainControl(method = "boot", number = 25)
tg7 <- expand.grid(mtry=seq(2,38,by=3))
tune7 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "rf", 
  metric = "Rsquared", tuneGrid = tg7, trControl = ctrl7)

Stochastic Gradient Boosting (SGB)

SGB is implemented in the train() function using the gbm method fits generalized boosted regression models. The hyperparameters declared in the tuning grid to be evaluated are various variable interaction depths (interaction.depth), fitted trees (n.trees), learning rates (shrinkage), and a minimum of 10 observations in the trees terminal nodes (n.minobsinnode).

set.seed(624)
ctrl8 <- trainControl(method = "boot", number = 25)
tg8 <- expand.grid(interaction.depth=seq(1,6,by=1), n.trees=c(25,50,100,200),
  shrinkage=c(0.01,0.05,0.1,0.2), n.minobsinnode=10)
tune8 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "gbm", 
  metric = "Rsquared", tuneGrid = tg8, trControl = ctrl8, verbose=F)

Rule-Based Cubist (CUBE)

CUBE is implemented in the train() function using the cubist method which fits a rule-based M5 model with additional corrections based on nearest neighbors in the training set. The hyperparameters declared in the tuning grid to be evaluated are the committees parameter which specifies how many boosting interactions should be used in the ensemble, and the neighbors parameter which specifies the number of instances to use to correct the rule-based prediction.

set.seed(624)
ctrl9 <- trainControl(method = "boot", number = 25)
tg9 <- expand.grid(committees = c(1,5,10,20,50,100), neighbors = c(0,1,3,5,7))
tune9 <- train(x = CMP_train_X, y = CMP_train$Yield, method = "cubist", 
  metric = "Rsquared", tuneGrid = tg9, trControl = ctrl9)

Exercise 8.7.a

Which tree-based regression model gives the optimal resampling and test set performance?

Approach

The ChemicalManufacturingProcess data partitioning, preprocessing, controls, PLS model, and KNN model are given in the question setup. The train() function, once computed and stored to a variable, contains several attributes which help evaluate model performance. A function is created to help display the resampling performance metrics as a data frame. A slight issue arises during forecasting due to a value of \(-\infty\) remaining in the pre-processed dataset. This infinite value hinders forecasting with the Neural Network and \(K\)-Nearest Neighbor models but does not affect the Multivariate Adaptive Regression Splines and Support Vector Machine models. For the sake of completeness in model comparison, the value of \(-\infty\) is replaced with the minimum finite value of the variable to which the infinite value belongs. The predict() function applies a fitted model to data consisting of predictor variables. The postResample() function takes two vectors as its inputs and computes the RMSE, \(R^2\), and MAE performance metrics. A data frame with the test set performance metrics is then displayed.

Results

tune3$modelInfo$label
data.frame(model="CART", tune3$bestTune, RMSE=min(tune3$results$RMSE), row.names="")
plot(tune3)

## [1] "CART"
##  model maxdepth   RMSE
##   CART        8 1.5226

tune4$modelInfo$label
data.frame(model="CIT", tune4$bestTune, RMSE=min(tune4$results$RMSE), row.names="")
plot(tune4)

## [1] "Conditional Inference Tree"
##  model maxdepth mincriterion     RMSE
##    CIT        1          0.9 1.420923

tune5$modelInfo$label
data.frame(model="M5", tune5$bestTune, RMSE=min(tune5$results$RMSE), row.names="")
plot(tune5)

## [1] "Model Tree"
##  model pruned smoothed rules    RMSE
##     M5     No      Yes   Yes 1.44276

tune6$modelInfo$label
data.frame(model="BAG", tune5$bestTune, RMSE=min(tune6$results$RMSE), row.names="")
plot(tune6)

## [1] "Bagged Model"
##  model pruned smoothed rules     RMSE
##    BAG     No      Yes   Yes 1.266828

tune7$modelInfo$label
data.frame(model="RF", tune7$bestTune, RMSE=min(tune7$results$RMSE), row.names="")
plot(tune7)

## [1] "Random Forest"
##  model mtry     RMSE
##     RF   11 1.249657

tune8$modelInfo$label
data.frame(model="SGB", tune8$bestTune, RMSE=min(tune8$results$RMSE), row.names="")
plot(tune8)

## [1] "Stochastic Gradient Boosting"
##  model n.trees interaction.depth shrinkage n.minobsinnode     RMSE
##    SGB     200                 6      0.05             10 1.263188

tune9$modelInfo$label
data.frame(model="CUBE", tune9$bestTune, RMSE=min(tune9$results$RMSE), row.names="")
plot(tune9)

## [1] "Cubist"
##  model committees neighbors     RMSE
##   CUBE        100         1 1.125089

Training Set Resampling

metrics <- function(tune) {
  RMSE = min(tune$results$RMSE)
  Rsquared = max(tune$results$Rsquared)
  MAE = min(tune$results$MAE)
  return(cbind(RMSE, Rsquared, MAE)) }
data.frame(rbind(metrics(tune1), metrics(tune2),
  metrics(tune3), metrics(tune4), metrics(tune5),
  metrics(tune6), metrics(tune7), metrics(tune8), metrics(tune9)),
  row.names = c("PLS","KNN","CART","CIT","M5","BAG","RF","SGB","CUBE"))

##          RMSE  Rsquared       MAE
## PLS  1.276354 0.5037055 1.0554596
## KNN  1.263295 0.5285458 1.0324313
## CART 1.522600 0.3430106 1.1958832
## CIT  1.420923 0.4054332 1.1562190
## M5   1.442760 0.4578988 1.1418822
## BAG  1.266828 0.5100838 1.0055814
## RF   1.249657 0.5314406 0.9782162
## SGB  1.263188 0.5076208 0.9857264
## CUBE 1.125089 0.6074287 0.8624796

Validation on Test Set

CMP_test_X <- predict(prepro, CMP_test[ , -1])
CMP_test_X <- CMP_test_X[ ,-c(nzv, mcl)] 
val <- min(CMP_test_X[is.finite(CMP_test_X[,35]), 35])
CMP_test_X[CMP_test_X[,35] == -Inf, 35] <- val
fcast1 <- predict(tune1, newdata = CMP_test_X)
fcast2 <- predict(tune2, newdata = CMP_test_X)
fcast3 <- predict(tune3, newdata = CMP_test_X)
fcast4 <- predict(tune4, newdata = CMP_test_X)
fcast5 <- predict(tune5, newdata = CMP_test_X)
fcast6 <- predict(tune6, newdata = CMP_test_X)
fcast7 <- predict(tune7, newdata = CMP_test_X)
fcast8 <- predict(tune8, newdata = CMP_test_X)
fcast9 <- predict(tune9, newdata = CMP_test_X)
data.frame(rbind(postResample(pred = fcast1, obs = CMP_test$Yield), 
  postResample(pred = fcast2, obs = CMP_test$Yield), 
  postResample(pred = fcast3, obs = CMP_test$Yield),
  postResample(pred = fcast4, obs = CMP_test$Yield),
  postResample(pred = fcast5, obs = CMP_test$Yield),
  postResample(pred = fcast6, obs = CMP_test$Yield),
  postResample(pred = fcast7, obs = CMP_test$Yield),
  postResample(pred = fcast8, obs = CMP_test$Yield),
  postResample(pred = fcast9, obs = CMP_test$Yield)), 
  row.names =  c("PLS","KNN","CART","CIT","M5","BAG","RF","SGB","CUBE"))

##          RMSE  Rsquared       MAE
## PLS  1.761001 0.3990817 1.1104420
## KNN  1.368251 0.5493334 1.0565909
## CART 1.749335 0.2936150 1.3255474
## CIT  1.585968 0.3970426 1.2146354
## M5   2.050190 0.2781859 1.3634098
## BAG  1.357619 0.5735154 1.0251716
## RF   1.372159 0.5694778 0.9933338
## SGB  1.189012 0.6715511 0.8399855
## CUBE 1.603838 0.5225231 0.8327512

Interpretation

After fitting tree-based CART, CIT, M5, BAG, RF, SGB and CUBE models to the ChemicalManufacturingProcess data which were previously found to be best suited for fitting with a linear PLS or nonlinear KNN model, it appears that most suitable model for the ChemicalManufacturingProcess data is a CUBE model. The CUBE model with 100 committees and 1 neighbor outperforms the other models in every resampling performance metric. Several models outperform the CUBE model on the test set with the SGB performing the best, but these test set metrics are less important to predictive ability than the resampling performance metrics.

Exercise 8.7.b

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Approach

The dotPlot() function create a dotplot of variable importance values. The varImp() function calculates the variable importance for a given model.

Results

dotPlot(varImp(tune6), top=15)

varimp1 <- varImp(tune9)$importance
varimp2 <- varImp(tune1)$importance
varimp3 <- varImp(tune2)$importance
colnames(varimp1) <- "CUBE"
colnames(varimp2) <- "PLS"
colnames(varimp3) <- "KNN"
imp <- data.frame(varimp1, varimp2, varimp3)
head(imp[order(imp$CUBE, decreasing = T), ], 10)

##                        CUBE      PLS      KNN
## ManufacturingProcess32  100 50.30322 35.84409
## BiologicalMaterial03     50 54.80568 49.52172
## BiologicalMaterial06     50 52.82129 42.61298
## ManufacturingProcess09   50 25.67340 14.77863
## ManufacturingProcess25   50 61.51722 74.50577
## ManufacturingProcess29   50 52.70012 27.90151
## BiologicalMaterial01      0 18.94683 17.86572
## BiologicalMaterial04      0 42.67018 16.62820
## BiologicalMaterial05      0 47.86206 30.32483
## BiologicalMaterial08      0 17.20016 11.17837

Interpretation

The CUBE model which was found to be the optimal tree-based regression model, the KNN model which was found to be the optimal nonlinear regression model, and the PLS model which was found to be the optimal linear regression model all agree that the most important predictor is the ManufacturingProcess32 process variable. All models also agree that BiologicalProcess32 is a very important biological process, but the CUBE model ranks it equal to the BiologicalMaterial03 biological process. The PLS and KNN models clearly distinguish BiologicalProcess32 as the most important biological process and BiologicalMaterial03 as the second most important biological process. Another interesting difference in the variable importance of the CUBE model is that only five predictors are listed as important, one of which (ManufacturingProcess25) is not even in the top ten of the PLS or the KNN models.

Exercise 8.7.c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Approach

Decision trees for single tree models can be easily plotted using the base R plot() function with the finalModel attribute of the tuned model. Plotting decision trees from CIT models presents no issues, but plotting decision trees from CART models sometimes results in blank nodes. Wrapping the finalModel attribute of the tuned CART model in the as.party() function from the partykit library resolves these CART model decision tree plotting issues. To avoid overlapping text that renders the tree illegible, the font size is reduced in using the gp=gpar() parameter.

Results

plot(tune4$finalModel)

plot(as.party(tune3$finalModel), gp=gpar(fontsize = 8))

Interpretation

The optimal single tree model based on RMSE, \(R^2\), and MAE resampling performance metrics is the CIT model. The decision tree produced for the CIT model shows a single split for ManufacturingProcess32 at the 0.111 value. Values lower than 0.111 are associated with lower yields as represented by the boxplot distributions. Values over 0.111 are associated with higher yields as represented by the boxplot distributions. The visualization of the tree does help in understanding the relationship between ManufacturingProcess32 and the yield, but not between any other predictor and the yield since there is only one split. The CART model tree is plotted for comparison. This tree has several branches and details the same information about ManufacturingProcess32, albeit at a different cutoff value of 0.2, but also provides a much greater understanding of the relationship between the biological and process predictors and the yield due to the additional branches.