8.1 Recreate the simulated data from Exercise 7.2

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
model1 <- randomForest(y~., data = simulated,
                       importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

No, the scores for V6-V10 were much lower than the other predictors.

b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

The score for V1 drops with the addition of another highly correlated predictor, duplicate1, going from 8.732235404 to 6.02363848. If we were to add another highly correlated predictor, V1 will likely drop further.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

Indeed, the score for V1 dropped to 4.750274828, and duplicate1 also dropped.

c) Use the cforest function in the party package to fit a random forest model using the conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the tradtional random forest model?

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varimp(model4)
rfImp4
##          V1          V2          V3          V4          V5          V6 
##  4.11910478  5.81149273  0.01133118  7.16991462  1.62054023 -0.01818532 
##          V7          V8          V9         V10  duplicate1  duplicate2 
## -0.02795336 -0.03663807 -0.04360946 -0.03454443  4.62957645  1.08104915

They do, but to a lesser degree. The score for V1 is slightly higher in model4 than in model3.

d) Repeat this process with difference tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
## Loaded gbm 2.1.5
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##                   var    rel.inf
## V4                 V4 29.1891921
## V1                 V1 24.2202771
## V2                 V2 23.0396157
## V5                 V5 10.6261851
## V3                 V3  7.3239995
## duplicate1 duplicate1  5.0541842
## V6                 V6  0.3612731
## duplicate2 duplicate2  0.1852732
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000
library(Cubist)
cModel <- cubist(simulated[, -11], simulated[, 11])
varImp(cModel)
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0
## duplicate2       0

We see that the boosted tree does follow the same pattern as the others, but not the cubist model. Or rather, the cubist model might, but seems to normalize the scores. All the same predictors are considered important for the cubist model.

8.2 Use a simulation to show tree bias with different granularities.

set.seed(42)
X1 <- sample(1:10, 100, replace = T)
X2 <- rnorm(100)
X3 <- rnorm(100, mean = 0, sd = 4)
X4 <- sample(1:10/10, 100, replace = T)
Y <- X1 + rnorm(100, mean = 0, sd = .5)

df <- data.frame(X1, X2, X3, X4, Y)

Given our predictors, we should see that X1 has the most importance, given that it has the lowest granularity out of all the predictors.

library(rpart)
simTree <- rpart(Y~., data = df)
varImp(simTree)
##      Overall
## X1 3.3176171
## X2 0.3638746
## X3 0.5578960
## X4 0.2486994

Indeed, because the values of V1 are so much larger than the others, it has a much higher importance than any of the others.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has a larger bagging fraction and learning rate, which means more of the dataset is used in creating each tree and more of each simulated tree is used for the final prediction. This would suggest that the right model might have a better sense of which predictors are truly important. After all, the book itself notes “that small[er] values of the learning parameter (< 0.01) work best”.

b) Which model do you think would be more predictive of other samples?

That said, without testing either of the models, there is a strong chance that the model on the right may over fit the data, as it disregards so many other predictors, especially predictors that the left model consider important.

c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig 8.24?

Given that interaction depth is really just tree depth, one would assume that the slope would only grow more drastic. The right model, which uses more of the dataset and more trees, would only reinforce its previous decisions with a greater tree depth. The same would then be true for the left tree.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Using the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(impute)
library(ipred)

cmp <- data("ChemicalManufacturingProcess")
cmp <- impute.knn(as.matrix(ChemicalManufacturingProcess))
cmp <- as.data.frame(cmp$data)

smp <- floor(0.75 * nrow(cmp))

x <- sample(seq_len(nrow(cmp)), size = smp)
y <- sample(seq_len(nrow(cmp)), size = smp)

cmpTrain <- cmp[x,-1]
cmpTest <- cmp[-x,-1]
yTrain <- cmp[y,1]
yTest <- cmp[-y,1]

# Bagged Trees
cmpBT <- ipredbagg(yTrain, cmpTrain)

# Random Forest
cmpRF <- randomForest(cmpTrain, yTrain)

# Boosted Tree
cmpGBM <- gbm.fit(cmpTrain, yTrain, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        3.4844             nan     0.0010    0.0001
##      2        3.4838             nan     0.0010   -0.0000
##      3        3.4838             nan     0.0010   -0.0005
##      4        3.4834             nan     0.0010   -0.0002
##      5        3.4831             nan     0.0010   -0.0005
##      6        3.4826             nan     0.0010   -0.0001
##      7        3.4822             nan     0.0010    0.0003
##      8        3.4820             nan     0.0010   -0.0002
##      9        3.4815             nan     0.0010   -0.0002
##     10        3.4814             nan     0.0010   -0.0007
##     20        3.4767             nan     0.0010    0.0002
##     40        3.4674             nan     0.0010    0.0002
##     60        3.4589             nan     0.0010    0.0000
##     80        3.4512             nan     0.0010    0.0000
##    100        3.4426             nan     0.0010   -0.0004
# Cubist
cmpCub <- cubist(cmpTrain, yTrain)

a) Which tree-based regression model gives the optimal resampling and test set performance?

cmpBT.pred <- predict(cmpBT, newdata = cmpTest)
cmpRF.pred <- predict(cmpRF, newdata = cmpTest)
cmpGBM.pred <- predict(cmpGBM, newdata = cmpTest,
                       n.trees = cmpGBM$n.trees)
cmpCub.pred <- predict(cmpCub, newdata = cmpTest)

postResample(cmpBT.pred, yTest)
##        RMSE    Rsquared         MAE 
## 1.855391440 0.001844473 1.577750438
postResample(cmpRF.pred, yTest)
##        RMSE    Rsquared         MAE 
## 1.855356831 0.002632157 1.587732136
postResample(cmpGBM.pred, yTest)
##        RMSE    Rsquared         MAE 
## 1.758457640 0.001630899 1.452252524
postResample(cmpCub.pred, yTest)
##        RMSE    Rsquared         MAE 
## 1.890693363 0.005350072 1.537437005

Generally speaking, the random forest model might offer the best performance. It doesn’t have the lowest RMSE, nor the highest \(R^2\), but that might mean it offers the best middle ground.

qqnorm(cmpBT.pred, main = "Bagged")
qqline(cmpBT.pred)

qqnorm(cmpRF.pred, main = "Random Forest")
qqline(cmpRF.pred)

qqnorm(cmpGBM.pred, main = "Boosted")
qqline(cmpGBM.pred)

qqnorm(cmpCub.pred, main = "Cubist")
qqline(cmpCub.pred)

The residual plots tend to agree, with the random forest having a generally better fit than most, save for perhaps the cubist model.

b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

cmpRF$importance
##                        IncNodePurity
## BiologicalMaterial01      9.09659840
## BiologicalMaterial02      5.52837630
## BiologicalMaterial03      6.59033877
## BiologicalMaterial04      9.06924320
## BiologicalMaterial05      6.52243630
## BiologicalMaterial06      6.28282125
## BiologicalMaterial07      0.09778131
## BiologicalMaterial08      6.13796946
## BiologicalMaterial09      8.93826202
## BiologicalMaterial10      8.70184021
## BiologicalMaterial11      5.78476367
## BiologicalMaterial12      7.51486626
## ManufacturingProcess01    6.75065464
## ManufacturingProcess02    4.55837864
## ManufacturingProcess03    4.24484920
## ManufacturingProcess04    8.37197485
## ManufacturingProcess05   16.16392014
## ManufacturingProcess06   13.67153994
## ManufacturingProcess07    2.24781856
## ManufacturingProcess08    1.25810497
## ManufacturingProcess09   13.18060219
## ManufacturingProcess10    8.93316473
## ManufacturingProcess11    8.15220580
## ManufacturingProcess12    0.44250416
## ManufacturingProcess13   27.24671173
## ManufacturingProcess14   10.13834102
## ManufacturingProcess15    9.56799779
## ManufacturingProcess16    9.70307869
## ManufacturingProcess17    9.55001138
## ManufacturingProcess18    6.65482687
## ManufacturingProcess19    6.61292835
## ManufacturingProcess20    8.09854293
## ManufacturingProcess21    6.12772689
## ManufacturingProcess22    6.35809946
## ManufacturingProcess23    4.32597074
## ManufacturingProcess24   10.48159891
## ManufacturingProcess25    6.72807779
## ManufacturingProcess26    5.82566678
## ManufacturingProcess27    7.88480245
## ManufacturingProcess28    2.75801357
## ManufacturingProcess29    4.67023377
## ManufacturingProcess30    5.01011088
## ManufacturingProcess31    7.53221053
## ManufacturingProcess32   14.88077737
## ManufacturingProcess33    4.98193170
## ManufacturingProcess34    1.84282354
## ManufacturingProcess35   36.53553664
## ManufacturingProcess36    7.28035732
## ManufacturingProcess37   13.00341253
## ManufacturingProcess38    3.91188278
## ManufacturingProcess39    8.13208298
## ManufacturingProcess40    1.09883529
## ManufacturingProcess41    1.01062441
## ManufacturingProcess42    7.01059978
## ManufacturingProcess43    5.08586389
## ManufacturingProcess44    2.26434939
## ManufacturingProcess45   10.03921195

If we sort, we’ll lose the labels, but we can just match that up visually.

sort(cmpRF$importance, decreasing = TRUE)
##  [1] 36.53553664 27.24671173 16.16392014 14.88077737 13.67153994
##  [6] 13.18060219 13.00341253 10.48159891 10.13834102 10.03921195
## [11]  9.70307869  9.56799779  9.55001138  9.09659840  9.06924320
## [16]  8.93826202  8.93316473  8.70184021  8.37197485  8.15220580
## [21]  8.13208298  8.09854293  7.88480245  7.53221053  7.51486626
## [26]  7.28035732  7.01059978  6.75065464  6.72807779  6.65482687
## [31]  6.61292835  6.59033877  6.52243630  6.35809946  6.28282125
## [36]  6.13796946  6.12772689  5.82566678  5.78476367  5.52837630
## [41]  5.08586389  5.01011088  4.98193170  4.67023377  4.55837864
## [46]  4.32597074  4.24484920  3.91188278  2.75801357  2.26434939
## [51]  2.24781856  1.84282354  1.25810497  1.09883529  1.01062441
## [56]  0.44250416  0.09778131
  1. BiologicalMaterial10 = 21.65270753
  2. ManufacturingProcess04 = 19.62509265
  3. BiologicalMaterial04 = 16.67539808
  4. ManufacturingProcess19 = 16.47245116
  5. ManufacturingProcess22 = 14.50093483
  6. ManufacturingProcess16 = 13.97917118
  7. BiologicalMaterial09 = 12.63348029
  8. BiologicalMaterial11 = 12.43943832
  9. ManufacturingProcess09 = 12.29018437
  10. ManufacturingProcess23 = 11.97977332

The manufacturing processes are almost equal with the biological materials, which was not strictly the same as the previous assignment. In that model, ManufacturingProcess40 ended up as the most important, but in this model it doesn’t make the top 10.

c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological process predictors and their relationship with yield?

The randomForest package doesn’t seem to have a native plot for trees. We can select a single tree through the getTree function, but I’m not sure which is the best one.

getTree(cmpRF)
##    left daughter right daughter split var split point status prediction
## 1              2              3        47  483.000000     -3   40.26735
## 2              4              5        36   11.500000     -3   42.05684
## 3              6              7        26 4739.500000     -3   39.96646
## 4              8              9        22    9.200000     -3   43.00500
## 5             10             11        47  474.500000     -3   40.43143
## 6              0              0         0    0.000000     -1   44.16000
## 7             12             13        25   32.550000     -3   39.85209
## 8             14             15         3   66.375000     -3   43.22222
## 9              0              0         0    0.000000     -1   42.35333
## 10             0              0         0    0.000000     -1   42.15500
## 11             0              0         0    0.000000     -1   39.74200
## 12             0              0         0    0.000000     -1   46.34000
## 13            16             17        10    2.335000     -3   39.73194
## 14             0              0         0    0.000000     -1   42.99200
## 15             0              0         0    0.000000     -1   43.51000
## 16            18             19        26 4886.500000     -3   38.44000
## 17            20             21        15    1.505000     -3   39.99033
## 18            22             23         8   16.415000     -3   38.19250
## 19             0              0         0    0.000000     -1   40.42000
## 20             0              0         0    0.000000     -1   37.71200
## 21            24             25        44  163.500000     -3   40.12435
## 22             0              0         0    0.000000     -1   36.47500
## 23            26             27        49    0.650000     -3   38.43786
## 24            28             29        47  508.000000     -3   40.40435
## 25            30             31        23    9.612698     -3   38.91687
## 26             0              0         0    0.000000     -1   37.66500
## 27            32             33        17  990.100000     -3   38.74700
## 28            34             35         1    6.915000     -3   40.66750
## 29            36             37        25   34.250000     -3   38.65000
## 30            38             39        44  164.500000     -3   39.86500
## 31            40             41         4   13.175000     -3   38.34800
## 32             0              0         0    0.000000     -1   38.96750
## 33            42             43         2   52.715000     -3   38.60000
## 34            44             45        49    2.050000     -3   40.95714
## 35            46             47        45   61.071429     -3   39.37727
## 36             0              0         0    0.000000     -1   37.65200
## 37             0              0         0    0.000000     -1   39.89750
## 38             0              0         0    0.000000     -1   38.42000
## 39             0              0         0    0.000000     -1   40.15400
## 40             0              0         0    0.000000     -1   38.82500
## 41            48             49        37 4879.500000     -3   38.03000
## 42             0              0         0    0.000000     -1   38.37000
## 43             0              0         0    0.000000     -1   38.64600
## 44            50             51        37 4868.500000     -3   40.83277
## 45             0              0         0    0.000000     -1   43.88000
## 46             0              0         0    0.000000     -1   43.38000
## 47            52             53        44  158.500000     -3   38.97700
## 48             0              0         0    0.000000     -1   37.94800
## 49             0              0         0    0.000000     -1   38.44000
## 50            54             55         2   48.520000     -3   40.49294
## 51            56             57        38 6080.500000     -3   41.72154
## 52             0              0         0    0.000000     -1   38.23800
## 53             0              0         0    0.000000     -1   39.71600
## 54             0              0         0    0.000000     -1   42.69000
## 55            58             59        54   11.100000     -3   40.42636
## 56            60             61         2   60.320000     -3   42.30000
## 57             0              0         0    0.000000     -1   40.42000
## 58             0              0         0    0.000000     -1   38.98000
## 59            62             63        11  143.795000     -3   40.51968
## 60            64             65        44  158.500000     -3   42.13500
## 61             0              0         0    0.000000     -1   43.62000
## 62            66             67        18  206.950000     -3   39.97111
## 63            68             69         4   12.260000     -3   40.74409
## 64             0              0         0    0.000000     -1   42.19800
## 65             0              0         0    0.000000     -1   42.03000
## 66            70             71        53    0.050000     -3   39.57667
## 67             0              0         0    0.000000     -1   40.76000
## 68            72             73        26 4869.000000     -3   40.96867
## 69            74             75        30 4763.000000     -3   40.26286
## 70             0              0         0    0.000000     -1   40.01000
## 71             0              0         0    0.000000     -1   39.14333
## 72            76             77        29   34.400000     -3   40.83364
## 73             0              0         0    0.000000     -1   41.34000
## 74             0              0         0    0.000000     -1   40.57000
## 75             0              0         0    0.000000     -1   40.14000
## 76             0              0         0    0.000000     -1   40.68000
## 77            78             79        53    0.050000     -3   40.92143
## 78             0              0         0    0.000000     -1   40.87000
## 79             0              0         0    0.000000     -1   40.96000

If we could view the tree, however, we would have a new perspective on the relationship between the biological and manufacturing predictors. We would see how a particular tree, or in the question’s case, an optimal tree, would choose to group the predictors together and which sits as the most important predictor for that tree.