8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Yes, they were used, but their importance was significantly smaller than V1-5.

b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change?
The importance score for V1 did change (from 8.73 to 8.69), but it was still the highest importance. The new variable, duplicate1, was not considered important.

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

What happens when you add another predictor that is also highly correlated with V1?

Adding second additional predictor that is also highly correlated with V1 had a greater impact on the importance of V1 (it went from 8.69 to 6.02) then the first highly correlated additional predictor. V1 is no longer the highest importance (that is now v4, 6.93). Also, the first additional predictor, duplicate1, did not show as important either when it was the only duplicate or here. But when the second additional predictor, duplicate2, was added duplicate2 had an importance (3.823).

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The top 5 predictors were the same in both the traditional random forest model and the cforest model. The values were similar, but different, and the order was slightly changed, with the V2 and V1 being 2nd and 3rd (respectively) in the traditional rf and here with cforest it was v1 as 2nd and v2 as 3rd (they swapped places).

model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varimp(model4, scale = FALSE)
rfImp4
##          V1          V2          V3          V4          V5          V6 
##  5.42152778  5.79075433 -0.01425384  5.96294311  1.87652947 -0.06822257 
##          V7          V8          V9         V10  duplicate1  duplicate2 
##  0.07696942 -0.17067028  0.02155362 -0.13376134  5.27935126  2.92224516

d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

For the boosted trees, the top 5 are the same as in traditional rf and cforest, with the order matching the traditional rf. The scale of the importance is much larger.

For the Cubist model, 5 predictors have the same importance value. 4 of the 5 overlap with the previous models, but duplicate2, the second highly correlated variable we introduced, was not included, and v6 was.

#boosted
gbmModel <- gbm(y ~ ., 
                data = simulated, 
                distribution = "gaussian")

summary(gbmModel)

##                   var    rel.inf
## V4                 V4 30.8498158
## V2                 V2 20.9639117
## V1                 V1 16.1293953
## duplicate1 duplicate1 11.5441703
## V5                 V5 11.0492470
## V3                 V3  8.5487055
## duplicate2 duplicate2  0.7137245
## V6                 V6  0.2010299
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000
cubistModel <- cubist(simulated[,-11],simulated$y )

varImp(cubistModel)
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate2       0

8.2

Use a simulation to show tree bias with different granularities.

The text (p182) states: “…these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors”.

Creating a simulated dataset with 2 predictors:

Create a single tree, and look at the importance. Though both x1 and x2 contribute to y, x1 with the larger range of values is given a higher importance.

#create the data

set.seed(888)

x1 <- runif(500, 0, 1000) # range of 0 - 1000
x2 <- runif(500, 0, 10)  # range of 0 - 10

#make y based on both x1 & x2 + some noise
y <-  .5 * x1 -2 * x2 + rnorm(500, sd=2)

sim <- data.frame(x1, x2, y)

head(sim)
##          x1       x2          y
## 1  25.50811 6.606107  -2.985639
## 2 346.68075 1.701234 168.820088
## 3  61.24983 8.801614  11.531905
## 4 683.75206 2.359146 336.659749
## 5 767.25377 3.315835 375.861369
## 6 104.10757 8.385882  36.251889
Tree <- rpart(y ~ ., data = sim)

varImp(Tree, scale = FALSE)
##      Overall
## x1 5.2760359
## x2 0.2954787

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

From the text (pg 206):
* “… constrain the learning process by employing regularization, or shrinkage…Instead of adding the predicted value for a sample to previous iteration’s predicted value, only a fraction of the current predicted value is added to the previous iteration’s predicted value. This fraction is commonly referred to as the learning rate and is parameterized by the symbol, λ. This parameter can take values between 0 and 1 and becomes another tuning parameter for the model.”
* “…randomly select a fraction of the training data. The residuals and models in the remaining steps of the current iteration are based only on the sample of data. The fraction of training data used, known as the bagging fraction, then becomes another tuning parameter for the model…”

a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Since the right model uses a learning and bagging rate of 0.9, it is taking 90% (almost all) of the current predicted value of the prior iteration’s predicted value on 90% of the training data. By doing this, each step reinforces the results of the prior step, and we are left with larger importance of a smaller number of predictors.

b

Which model do you think would be more predictive of other samples?

The model on the right, with the learning and bagging rate of 0.1 results in a model with a larger number of predictors, which is more likely to be more predictive of other samples.

c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

The larger value of shrinkage, in this case when the learning rate is 0.9, has an impact on reducing RMSE for all choices of tree depths and number of trees. (text, pg 206-207)

Increasing the interaction (tree) depth will have an impact on both models in Fig 8.24, but it will have a larger impact on left, where the learning rate is 0.1.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Replicating the preprocessing used in 6.3 and 7.5:

library(AppliedPredictiveModeling) 
data(ChemicalManufacturingProcess)

cmp <- preProcess(ChemicalManufacturingProcess,
                        method = "knnImpute")

cmp <- predict(cmp, ChemicalManufacturingProcess)

#split
set.seed(888)

train_rw2 <- createDataPartition(cmp$Yield,
                                 p = 0.8,
                                 list = FALSE)

train_cmp <- cmp[train_rw2,-1]
train_yld <- cmp$Yield[train_rw2]

test_cmp <- cmp[-train_rw2,-1]
test_yld <- cmp$Yield[-train_rw2]

train_cmp2 <- train_cmp[,-nearZeroVar(train_cmp)]
test_cmp2 <- test_cmp[,-nearZeroVar(train_cmp)]

a

Which tree-based regression model gives the optimal resampling and test set performance?

Tried Random Forest, Boosted Tree and Cubist, with the following results:

Random Forest
* The \(R^2\) tells us that 63% of the variation in the test data can be predicted using this model
* RSME tells us that the sqrt of the average of squared differences between prediction and actual observation of the test data is 0.549

GBM
* The \(R^2\) tells us that 64% of the variation in the test data can be predicted using this model
* RSME tells us that the sqrt of the average of squared differences between prediction and actual observation of the test data is 0.537

Cubist
* The \(R^2\) tells us that 68% of the variation in the test data can be predicted using this model
* RSME tells us that the sqrt of the average of squared differences between prediction and actual observation of the test data is 0.528

Cubist has the best results, with the largest \(R^2\) and the smallest RSME.

Random Forest

rf <- train(train_cmp2, train_yld,
            method = 'rf',
            importance = TRUE)

rfPred <- predict(rf, newdata= test_cmp2)
postResample(pred= rfPred, obs=test_yld)
##      RMSE  Rsquared       MAE 
## 0.5443417 0.6355169 0.4227794

Boosted Trees

gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(100, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = 10)


gbm <- train(train_cmp2, train_yld,
            method = 'gbm',
            tuneGrid = gbmGrid,
            verbose = FALSE)

gbmPred <- predict(gbm, newdata= test_cmp2)
postResample(pred= gbmPred, obs=test_yld)
##      RMSE  Rsquared       MAE 
## 0.5228054 0.6637246 0.4052370
Cubist
cube <- train(train_cmp2, train_yld,
            method = 'cubist',
            verbose = FALSE)

cubePred <- predict(cube, newdata= test_cmp2)
postResample(pred= cubePred, obs=test_yld)
##      RMSE  Rsquared       MAE 
## 0.5277177 0.6755219 0.3931756

b

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

See list of predictors below. There are 9 Manufacturing and 1 Biological in the top 10, so Manufacturing ‘dominates’.

Comparing to the results to those with linear model and nonlinear models:

  • ManufacturingProcess (MP) 32 was the highest importance in the linear, and second in the nonliner. It is again of the highest importance here.
  • MP 13 which was 2nd in linear and 1st in nonlinear, moves down to 7th.
  • MP 17 and MP 09 are also in the top 10 for linear model, nonlinear and here with trees. But MP 36, which was in the top 10 for linear and nonlinear is not (and in fact not in the top 20).
  • MP 25 shows importance with trees and was not in the top 10 for either linear or nonlinear models.
  • The ratio of Manufacturing to Biological in the top 10 was 7:3 for the linear model, 6:4 for nonlinear and 9:1 for trees, so Manufacturing was more ‘dominant’ in each, but by how much varried.
varImp(cube)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  75.000
## ManufacturingProcess09  55.682
## ManufacturingProcess01  35.227
## BiologicalMaterial12    27.273
## ManufacturingProcess13  23.864
## ManufacturingProcess29  23.864
## ManufacturingProcess33  20.455
## ManufacturingProcess25  18.182
## BiologicalMaterial02    15.909
## ManufacturingProcess04  15.909
## BiologicalMaterial03    10.227
## ManufacturingProcess26   6.818
## BiologicalMaterial06     6.818
## ManufacturingProcess31   6.818
## ManufacturingProcess14   5.682
## ManufacturingProcess45   5.682
## ManufacturingProcess27   5.682
## ManufacturingProcess28   5.682
## BiologicalMaterial04     4.545

c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

From this tree plot, we can see:
* 17% of the results - the 8th (7%) + 9th (10%) leaves - can be predicted without any BiologicalMaterial (BM) factored in. These leaves come from ManufacturingProcess (MP) 32, then MP 31 then MP 4.
* 58% of the results (the right side of the tree, leaves 6 - 12) have only 1 BM (either BM 12 for leaves 6 & 7 or BM 03 for 9-12).
* Only the leftmost 3 leaves (for a total of 21%) have more BM than MP (using MP32, BM12, BM05 and BM09).

This clearly shows the dominance of the MP predictors.

#since we are going to plot this, let's rename relevant fields
train_cmp2 <- train_cmp2 |> 
  rename(MP32 = ManufacturingProcess32,
         MP31 = ManufacturingProcess31,
         MP17 = ManufacturingProcess17,
         MP04 = ManufacturingProcess04,
         BM12 = BiologicalMaterial12,
         BM05 = BiologicalMaterial05,
         BM03 = BiologicalMaterial03,
         BM09 = BiologicalMaterial09,
         BM10 = BiologicalMaterial10)

set.seed(888)
tree2 <- train(train_cmp2, train_yld,
               method = "rpart2",
               tuneLength = 10,
               trControl = trainControl(method = "cv"))

rpart.plot(tree2$finalModel)