Trees and Rules Based Models

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, it didn’t use them much.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9328125

Fit another random forest model to these data. Did the importance score for V1 change?

Yes, V1 is now less important.

What happens when you add another predictor that is also highly correlated with V1?

The duplicate predictor’s importance is taken away from V1’s importance, because the model randomly chooses to use the duplicate predictor to split some trees, where it was using V1 before.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The standard method (conditional = F) shows the same pattern as the previous model, where V1 shared its importance with the added, correlated predictor. V4 was also more important in the cforest model, although there is randomness so maybe that should be expected, even after setting the same seed, since the models probably split differently. Also, I may not have set all the default parameters to be equal in both models.

When using the modified method to calculate variable importance, the unimportant variables remain unimportant, but the correlated yet important ones (V1 and duplicate1) have become less important, relative to the other important predictors (V2, V4, and V5). The dilution of V1’s importance still shows, in its sharing importance with duplicate1, but their combined importance is now less than each of V2 and V4.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted, without the correlated feature added:

Without the duplicate feature added, the unimportant features are unimportant, but the relative importance of the important predictors isn’t the same as with the randomForest model: V5 is slightly more important now, and V3 is much more important. This is actually the opposite of what might be expected of a boosted model, which should have a higher concentration of variable importance in the most important ones. Let’s see how relative importances change with the addition of the duplicate feature.

The correlated feature did take on an importance of the same magnitude as the reduction in V1’s importance. There were also some other relative changes, such as the least important of the 5 “real” predictors, V3, becoming even more important (“even more”, because it was already much more important in the first boosted model than in the random forests earlier).

Cubist:

First without the duplicate feature:

Using 100 models in a committee, and only measuring importance by how many times a predictor was used in a split, V1, V2, and V4 are the most important, as with previous models. But V6 has somehow become important, and V3 is more important than with previous models.

Now with the duplicate feature:

In the case of the Cubist model, at least how I built it, the correlated/duplicate feature takes away relative importance from all features, not just the one it’s correlated with. But it isn’t used much, at any rate—even a “fake” feature like V6 is more important than it.

It doesn’t seem like the importances being discussed here are as informative as with previous models, and the Cubist model variable importances sounded more difficult to make sense of, in the book.

8.2. Use a simulation to show tree bias with different granularities.

I’ll simulate \(y=cos(x)\) by using a Taylor polynomial plus some added noise. One noise term will be a small random uniform term added to the response (y), and another will be a larger random uniform term used as a predictor. With a small enough training sample, like 100 numbers, for example, a tree model may be “tempted” to use the higher powers of the polynomial as splits. Or maybe even the random noise term.

By using a range of complexity penalties, we can observe how a tree’s bias responds, as lower penalties allow the model to add higher-order polynomial terms or more segmentation of the lower-order polynomial terms.

set.seed(624)
N = 100
noise1 = rnorm(N) / pi
set.seed(200)
noise2 = rnorm(N)
set.seed(624)
x = runif(N, -5, 5)
x1 = x ^ 2 / -factorial(2)
x2 = x ^ 4 / factorial(4)
x3 = x ^ 6 / -factorial(6)
x4 = x ^ 8 / factorial(8)
x5 = x ^ 10 / -factorial(10)
x6 = x ^ 12 / factorial(12)
y =  cos(x) + noise1
sim = data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4, x5=x5, x6=x6, n=noise2)
head(sim)

##             y         x1        x2          x3         x4           x5
## 1  0.11525640  -9.416476 14.778336  -9.2773229 3.11998879 -0.652873309
## 2 -1.41540594  -2.696188  1.211572  -0.2177750 0.02097009 -0.001256429
## 3 -0.06175457 -11.972435 23.889866 -19.0679913 8.15322445 -2.169198869
## 4 -0.74108932  -3.709037  2.292826  -0.5669450 0.07510071 -0.006190029
## 5 -0.91194892  -3.814257  2.424760  -0.6165771 0.08399227 -0.007119292
## 6  0.08650217 -11.147933 20.712734 -15.3936111 6.12881931 -1.518303676
##             x6           n
## 1 9.314797e-02  0.08475635
## 2 5.132681e-05  0.22646034
## 3 3.934938e-01  0.43255650
## 4 3.478643e-04  0.55806524
## 5 4.114365e-04  0.05975527
## 6 2.564537e-01 -0.11464087

plot(x,y, main = "Approximation of Cosine")

First with a penalty of 0.05, which will be high enough to make the tree use only the most crucial predictors, namely \(x_1 = \frac{x^2}{-2!}\) and \(x_2 = \frac{x^4}{4!}\)

library(rpart)
library(partykit)

## Loading required package: libcoin

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

set.seed(624)
rp1 = rpart(y ~ ., sim, control = rpart.control(cp = .05))
plot(as.party(rp1))

Since we’re focused on tree bias in this question, it’s worth noting that in the leftmost leaf node, where 52 of the 100 samples end up, the IQR goes from about -.65 to -1.15, and the rest of the values range from about 0 to -1.75. All of those 52 samples will be predicted as the mean, around -.85.

Now with less of a penalty:

set.seed(624)
rp2 = rpart(y ~ ., sim, control = rpart.control(cp = .005))
plot(as.party(rp2))

By reducing the penalty from .05 to .005, the tree grew by 2 levels. The model is still only using the 2 most important predictors, although it has segmented the values into more specific ranges for both. Now the leftmost leaf node has about the same prediction (around -.9 now, vs -.85 in the previous model), but there are only 38 samples there, rather than 52, allowing the tree to be less biased in its predictions. The range of values is smaller, because the 14 samples now missing from that node are to the right of it, in the next 2 leaves over. 7 of them are predicted to be slightly higher, in Node 6, and 7 of them considerably higher, in Node 7.

By extension, reducing the complexity penalty further should result in even less bias, by allowing the tree to add splits at less cost.

set.seed(624)
rp3 = rpart(y ~ ., sim, control = rpart.control(cp = 0.0005))
plot(as.party(rp3))

Now there are 2 more levels in the tree, both branching from that leftmost node of the previous model. The 38 samples there got split twice, but the problem is that both splits are on the (ultimately useless) noise term. This is not so much a reduction in bias as a fault in the model fit. We don’t want it to fit the noise. We’d rather see \(x_3\) or \(x_4\) in there as splitting predictors. As it stands, this model may have reduced its fitted RMSE by adding those noise splits, but it won’t serve as any predictive help over the long haul. Where we do see slight yet helpful reduction of bias is over on the far right of the tree, where the 20 samples with the highest responses get split into leaves of 7 and 13. The improvement is very slight, because the predicted (average) value for both leaves is almost the same.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

The model on the right is choosing randomly between 9/10 of the samples every time it expands a new tree, so it is following the strongest leads almost every time, creating a greater concentration of importance in a smaller subset of predictors, as compared to the model on the left, which only chooses between 1/10 of the training observations every new tree, and thus is forced to use weaker signals more often. Meanwhile, the learning rate is relatively huge in the model on the right, so that it is using fewer trees to make its predictions, again causing the model to rely more heavily on a smaller subset of features. Essentially it is jumping to conclusions more quickly, and using the same rationale every time.

(b) Which model do you think would be more predictive of other samples?

I’d guess the model on the left would be more predictive, since it is more cautious in how it learns, and thus able to pick up on more patterns. It is less biased than the one on the right.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

I believe the slope on the right would be more gradual, i.e. more like the one on the left, because it would consider more variables, given more levels. Presumably it would stop splitting on NumCarbon and find another variable that interacted with it. For the model on the left, I don’t know that the slope would change as much, although I do think it would become more gradual as well, if anything. Once a tree has split on one feature, it has already fit the residual for that split in future trees and lower splits, making it less likely for that same feature to reduce the remaining error. But since the model is only considering a random 10% of the samples each iteration, the slope of the variable importances probably wouldn’t change that much, as this model is already using different features at each level of the process, by design. And within the 10% of samples it is given to fit to, there are less likely to be useful feature interactions that would affect the tree beyond the stump.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# remove one feature that has near-zero variance
prepro = ChemicalManufacturingProcess %>% select(-c(BiologicalMaterial07))
# turn the 0's in row 108 in NA's, then impute medians for all NA's in dataset
prepro[108, c(37:39,41:43)] = NA   #MP's 25-27, 29-31
for (feat in names(prepro)) {
  med = median(prepro[,feat], na.rm = T)
  prepro[feat][is.na(prepro[feat])] = med
}
# shorten the feature names
shorten = function(n) paste0(substr(n, 1, 1), substr(n, nchar(n)-1, nchar(n)))
names(prepro) = sapply(names(prepro), shorten)

# split train-test
set.seed(624)
trains = sample(1:176, 136)  # randomly choose 136 and test with the other 40
tests = setdiff(1:176, trains)
trainDF = prepro[trains,]
testDF = prepro[tests,]
trainX = trainDF %>% select(-c(Yld))
trainY = trainDF$Yld
testX = testDF %>% select(-c(Yld))
testY = testDF$Yld

I didn’t think standard-scaling the predictors would have any effect on these trees, since it doesn’t change where a tree model can choose to split on a feature, i.e. which samples end up on which side of any split. But the first model I tried with the scaled data performed slightly better than the non-scaled, so I’ll just use the scaled in every model. Probably when I scaled it, it created some tiny floating point errors that the tree was able to use to split two samples which actually had the same measurements, so it’s potentially the wrong thing to do.

# standard scaling
stdX = c()
nRow = dim(trainX)[1]
nCol = dim(trainX)[2]
testRow = dim(testX)[1]
testCol = dim(testX)[2]

for (c in 1:nCol) {stdX = c(stdX, sd(trainX[,c]))}
meanX = colMeans(trainX)
scaleX = trainX - matrix(replicate(nRow, t(meanX)), nrow = nRow, byrow = T)
scaleX = scaleX / matrix(replicate(nRow, t(stdX)), nrow = nRow, byrow = T)
# use the training set stats to scale the test set data
scaletestX = testX - matrix(replicate(testRow, t(meanX)), nrow = testRow, byrow = T)
scaletestX = scaletestX / matrix(replicate(testRow, t(stdX)), nrow = testRow, byrow = T)

Boosting

gbmGrid = expand.grid(.interaction.depth = 3:4,
                      .n.trees = c(500, 1000),
                      .shrinkage = c(.01),
                      .n.minobsinnode = c(3, 7))
set.seed(624)
gbmTune = train(scaleX, trainY, method = 'gbm', tuneGrid = gbmGrid, distribution = 'gaussian', verbose = F)

gbmTune

## Stochastic Gradient Boosting 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 136, 136, 136, 136, 136, 136, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.minobsinnode  n.trees  RMSE      Rsquared   MAE      
##   3                  3                500     1.189240  0.6221164  0.9095980
##   3                  3               1000     1.180210  0.6258785  0.9008749
##   3                  7                500     1.214103  0.6049161  0.9318986
##   3                  7               1000     1.206106  0.6114110  0.9255390
##   4                  3                500     1.176529  0.6294794  0.8996267
##   4                  3               1000     1.165992  0.6345757  0.8894680
##   4                  7                500     1.204181  0.6118271  0.9230287
##   4                  7               1000     1.193712  0.6191206  0.9162739
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  4, shrinkage = 0.01 and n.minobsinnode = 3.

The grid search ended up using the most overfittable level of all params, so it might be worth checking a couple with cross-val.

cvGrid = expand.grid(.interaction.depth = c(4, 6, 8),
                     .n.trees = c(1000),
                     .shrinkage = c(.01),
                     .n.minobsinnode = c(3, 6))
set.seed(624)
gbmCV = train(scaleX, trainY, method = 'gbm', tuneGrid = cvGrid, distribution = 'gaussian', 
              trControl = trainControl(method = 'cv'), verbose = F)

gbmCV

## Stochastic Gradient Boosting 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 124, 121, 121, 124, 123, 122, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.minobsinnode  RMSE      Rsquared   MAE      
##   4                  3               1.058300  0.6920817  0.8092401
##   4                  6               1.113570  0.6605014  0.8510027
##   6                  3               1.035744  0.6989872  0.8011331
##   6                  6               1.100065  0.6706249  0.8357560
##   8                  3               1.012897  0.7158286  0.7815668
##   8                  6               1.088453  0.6786011  0.8333103
## 
## Tuning parameter 'n.trees' was held constant at a value of 1000
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.01
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  8, shrinkage = 0.01 and n.minobsinnode = 3.

The cross-validation scores continued to get better as more depth and smaller leaves were used, so overfitting shouldn’t be a big issue.

gbmPreds = predict(gbmCV, newdata = scaletestX)
gbmResults = data.frame(obs = testY, pred = gbmPreds)
defaultSummary(gbmResults)

##      RMSE  Rsquared       MAE 
## 0.9724592 0.6706371 0.7372108

Those are the same as the best non-linear model found in the last assignment, which was SVM with a RBF.

Cubist

cubTune = train(scaleX, trainY, method = 'cubist')
cubTune

## Cubist 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 136, 136, 136, 136, 136, 136, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.756264  0.3440998  1.2857749
##    1          5          1.736347  0.3599423  1.2674817
##    1          9          1.736759  0.3572717  1.2686717
##   10          0          1.297482  0.5460305  0.9892586
##   10          5          1.282140  0.5572198  0.9699638
##   10          9          1.289095  0.5523544  0.9785725
##   20          0          1.245505  0.5767597  0.9542021
##   20          5          1.226297  0.5898758  0.9289577
##   20          9          1.234211  0.5840923  0.9404861
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.

Those won’t predict as well as the boosted trees.

Random Forests

rfGrid = expand.grid(.mtry = c(15, 20, 25))
set.seed(624)
rfTune = train(scaleX, trainY, method = 'rf', ntree = 1500,
               tuneGrid = rfGrid, trControl = trainControl(method = 'cv'))

rfTune

## Random Forest 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 124, 121, 121, 124, 123, 122, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##   15    1.129738  0.6832055  0.8843001
##   20    1.121964  0.6817494  0.8715418
##   25    1.119076  0.6778207  0.8632111
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 25.

rfPreds = predict(rfTune, newdata = scaletestX)
rfResults = data.frame(obs = testY, pred = rfPreds)
defaultSummary(rfResults)

##      RMSE  Rsquared       MAE 
## 1.0565397 0.6257032 0.8762222

Not as good as boosting or SVM. The amount of dropoff in Rsquared from fitting to testing suggests maybe ntree=1500 or mtry=25 was enough to overfit, despite the control offered by the CV.

Model Trees

library(RWeka)
set.seed(624)
m5tune = train(scaleX, trainY, method='M5', trControl = trainControl(method='cv'),
               control = Weka_control(M = 7))

m5tune

## Model Tree 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 124, 121, 121, 124, 123, 122, ... 
## Resampling results across tuning parameters:
## 
##   pruned  smoothed  rules  RMSE      Rsquared   MAE      
##   Yes     Yes       Yes    1.173784  0.6295134  0.9518252
##   Yes     Yes       No     1.171491  0.6325991  0.9416163
##   Yes     No        Yes    1.235069  0.6051261  1.0103112
##   Yes     No        No     1.289679  0.5702320  1.0395198
##   No      Yes       Yes    1.259056  0.5818645  0.9484234
##   No      Yes       No     1.221203  0.6023600  0.9324067
##   No      No        Yes    1.543762  0.4239000  1.1611382
##   No      No        No     1.617326  0.4136207  1.2365764
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were pruned = Yes, smoothed = Yes and
##  rules = No.

These won’t be as good either.

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Gradient boosted trees worked best, and produced test results very similar to those of SVM.

(b) - Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(gbmCV, numTrees = 1000)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##     Overall
## M32 100.000
## M13  21.542
## M09  19.432
## B12  17.220
## M17  14.586
## M31  12.600
## B03  12.112
## M06  11.464
## B09   9.805
## B05   8.240
## M11   6.727
## M21   6.347
## B06   6.213
## B02   5.874
## B08   5.367
## M05   4.770
## B11   4.768
## B01   4.608
## M39   4.328
## M43   4.086

Compared to the SVM importances:

svmRBF = data.frame(feat = c('M13','M32','M17','B06','M09','B03','M36','M31','B02','B12'),
                    imp = c(100.00000, 96.49254, 88.24801, 83.62914, 79.40360, 70.94889,
                            68.71461, 64.92377, 62.61581, 59.61470))
svmRBF

##    feat       imp
## 1   M13 100.00000
## 2   M32  96.49254
## 3   M17  88.24801
## 4   B06  83.62914
## 5   M09  79.40360
## 6   B03  70.94889
## 7   M36  68.71461
## 8   M31  64.92377
## 9   B02  62.61581
## 10  B12  59.61470

These are all the same names, just in a different order. The important ones for boosted trees were more similar to the SVM ones than either one was to the best linear model I found, for what that’s worth. The biggest difference in trees and SVM importances are that the trees placed a much greater importance on the top feature, M32, whereas the SVM model relied on a broader range of features, or at least to a more balanced degree. But the way these different models measure importance makes it feel like apples and oranges here.

The ratio of biological feature to manufacturing features is similar, at any rate, as are their relative importances in the top 10 (4 Bio, 6 Mfg, in almost exactly the same ranks). The 2 Bio features in the SVM top 10 that are replaced by other bio feats in the GBM turn up at spots 13 and 14 in the GBM, so everything is very similar. Only M36, at #7 in SVM, has disappeared from the top 20 GBM. It had a high, negative correlation with the predictor that dominated importance in GBM, M32:

cor(scaleX[,'M32'], scaleX[,'M36'])

## [1] -0.7967635

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

set.seed(624)
rpTune = train(scaleX, trainY, method = 'rpart2', tuneLength = 10, 
               trControl = trainControl(method='cv'))

rpTune

## CART 
## 
## 136 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 124, 121, 121, 124, 123, 122, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        1.484363  0.4030007  1.151774
##    2        1.524689  0.3890898  1.176829
##    3        1.459546  0.4407655  1.156763
##    4        1.396589  0.4844039  1.102704
##    5        1.465365  0.4595752  1.111515
##    6        1.409894  0.4870219  1.048742
##    7        1.417333  0.4857024  1.044813
##    9        1.472663  0.4685606  1.072259
##   10        1.475604  0.4673763  1.070339
##   11        1.475604  0.4673763  1.070339
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 4.

set.seed(624)
rp4 = rpart(Yld ~ ., prepro, control = rpart.control(cp = 0.001))
plot(as.party(rp4))

The goal of the experiment was to figure out how to increase the response, so the leaves on the right, which descend from higher values of M32, and have higher boxplot values, show that increasing that process is correlated with success. Even on the left half of the tree, where M32 values are smaller, there is a leaf of higher responses (Node 20) that looks like it belongs on the right half of the tree, and sure enough, it contains the highest of the lower M32 values. Another strongly recurring theme is that the biological predictor B09 splits cleanly with high vs. low responses in the split nodes (i.e. it is an important predictor/ correlates with the response). At the 4 splits on B09 in this tree, you can easily see how the samples with lower B09 values have significantly higher responses than those with higher B09 values. The way this experiment was designed, however, it’s impossible to say where the causation lies, and instead we just see where collinear predictors line up with responses. The hardest part of understanding what causes what is that M32 is for some reason highly correlated with the biological features, so that we don’t get to see how samples that have features with high biological numbers yet low M32 numbers predict the response. In that sense, B09 is actually a more useful predictor, since it doesn’t correlate nearly as highly with the response, yet it interacts meaningfully with other predictors to provide lots of predictive power.

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(prepro), type = "upper", diag = F, ) -> r #to suppress some output

Trees and Rules Based Models

Ethan Haley

4/12/2022