DATA 624 Homework 9

Author

Henock Montcho

Published

May 18, 2025

1-) Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

library(randomForest)
library(caret) 
model1 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

         Overall
V1   8.732235404
V2   6.415369387
V3   0.763591825
V4   7.615118809
V5   2.023524577
V6   0.165111172
V7  -0.005961659
V8  -0.166362581
V9  -0.095292651
V10 -0.074944788

model1 <- randomForest(y ~ ., data = simulated, 
importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Comment: The random forest model did not significantly use the uninformative predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 
cor(simulated$duplicate1, simulated$V1)

[1] 0.9418838

Fit another random forest model to these data. Did the importance score for V1 change?

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2

               Overall
V1          6.02363848
V2          6.19154188
V3          0.55277883
V4          6.92793183
V5          2.17101110
V6          0.15369922
V7          0.10720626
V8          0.00929209
V9         -0.05010858
V10         0.03861636
duplicate1  3.82307834

Comment: Yes, the importance score of V1 changed.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

[1] 0.9430605

model3 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3

                Overall
V1          4.750274828
V2          6.392645096
V3          0.546932231
V4          6.694197135
V5          2.354901393
V6          0.178559997
V7          0.003137176
V8         -0.067194296
V9         -0.088150851
V10        -0.040809537
duplicate1  2.718782851
duplicate2  2.914196065

Comment: After another predictor highly correlated to V1 predictor was added, the important scores for the other variables remain approximately the same (minor changes) while the importance score for V1 decreased significantly (from 8.73 to 6.02, then 4.75).

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func-tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4

          V1           V2           V3           V4           V5           V6 
 5.278213381  5.357164042  0.013611736  6.450219543  1.208606076  0.032344932 
          V7           V8           V9          V10 
 0.027047409 -0.016253565  0.007697236  0.009512732

Comment: The importance scores exhibit a pattern similar to that of the traditional random forest model. In the first model, V1 emerged as the most important variable, followed by V4. In contrast, the cforest model reversed this order, ranking V4 as the most important and V1 as second.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted trees

library(gbm)

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)

rfImp5

      Overall
V1  4634.0234
V2  4316.6044
V3  1310.7178
V4  4287.0793
V5  1844.2147
V6   416.3966
V7   368.9383
V8   224.4769
V9   246.1400
V10  250.6263

Comment: Although the importance scores are higher overall, the pattern remains largely consistent with the traditional random forest model. V1 continues to be the most important variable.

Cubist

library(Cubist)
set.seed(1001)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)

rfImp6

    Overall
V1     72.0
V3     42.0
V2     54.5
V4     49.0
V5     40.0
V6     11.0
V7      0.0
V8      0.0
V9      0.0
V10     0.0

Comment: Cubist also shows similar patterns as other patterns with V1 leading.

2-) Use a simulation to show tree bias with different granularities.

library(caret)
library(partykit)
library(party)
library(ggplot2)
library(rpart)
library(ipred)

set.seed(002)

a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)

y <- a + b + c + d + e

simData <- data.frame(a,b,c,d,e,y) 

rpartTree <- rpart(y ~ ., data = simData)

plot(as.party(rpartTree), gp = gpar(fontsize = 7))

varImp(rpartTree)

   Overall
a 4.083436
b 3.513219
c 2.599040
d 3.727643
e 3.471319

Comment: When a variable has a larger number of distinct values, decision trees are more likely to select it over others, even if it’s not informative. This increases the likelihood that the model will prioritize noisy variables over truly informative ones, especially near the top of the tree.

In this simulation, the tree-based model tended to assign higher importance to variables with more unique values. It also frequently chose noisy or highly repetitive variables as the root node, potentially leading to misleading interpretations.

3-) In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Comment: The bagging fraction refers to the proportion of the training data used in each iteration, while the learning rate controls how much of the current prediction is added to the previous one—essentially determining how quickly the model learns.

A lower learning rate is generally preferred because it allows the model to learn more gradually over a greater number of iterations, often leading to better performance. The model on the left, which uses a smaller learning rate and bagging fraction, learns more slowly and requires more computation, but this often results in improved accuracy. It also relies on a smaller subset of the data at each step.

In contrast, the model on the right likely suffers from overfitting due to its higher learning rate and bagging fraction. It uses more of the training data per iteration and updates predictions more aggressively. This faster learning can cause the model to place too much emphasis on the first few predictors, skewing feature importance and reducing generalization.

(b) Which model do you think would be more predictive of other samples?

Comment: The model on the left is likely to perform better, as its greater number of iterations reduces the influence of any single predictor. It is similar to a distribution with a greater R². This more gradual learning process enhances its predictive accuracy on unseen samples.

Interaction depth is the number of splits to perform on a tree, or the maximum nodes per tree. Increasing the interaction depth, allows the model to capture more complex interactions between features. the importance of the predictors increases, allowing the smaller important predictors to contribute more. This can have a noticeable effect on the slope (slope steeper) of predictor importance, especially when visualized.

4-) Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling) 
data(ChemicalManufacturingProcess)

set.seed(10001)

# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(00009)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Single Trees

set.seed(1001)

cartTune <- train(train_x, train_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

cartPred <- predict(cartTune, test_x)

postResample(cartPred, test_y)

     RMSE  Rsquared       MAE 
1.4697466 0.4223485 1.1158856

Bagged Trees

set.seed(10010)

baggedTree <- ipredbagg(train_y, train_x)
 
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)

     RMSE  Rsquared       MAE 
1.1662654 0.6252609 0.8500063

Random Forest

library(randomForest)
set.seed(2100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

     RMSE  Rsquared       MAE 
1.0866922 0.6708165 0.8106306

Boosted Trees

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

     RMSE  Rsquared       MAE 
1.0344506 0.7005943 0.7667668

Cubist

set.seed(120)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

     RMSE  Rsquared       MAE 
0.9468747 0.7625873 0.6869154

rbind(cart = postResample(cartPred, test_y),
      bagged = postResample(baggedPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))

                  RMSE  Rsquared       MAE
cart         1.4697466 0.4223485 1.1158856
bagged       1.1662654 0.6252609 0.8500063
randomForest 1.0866922 0.6708165 0.8106306
boosted      1.0344506 0.7005943 0.7667668
cubist       0.9468747 0.7625873 0.6869154

Comment: The Cubist model has the lowest RMSE and MAE, and the highest R² . It gives the optimal resampling and test set performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?

varImp(cubistTuned, scale = TRUE)

cubist variable importance

  only 20 most important variables shown (out of 56)

                       Overall
ManufacturingProcess32  100.00
BiologicalMaterial06     83.12
ManufacturingProcess13   77.92
BiologicalMaterial02     55.84
BiologicalMaterial12     46.75
ManufacturingProcess39   42.86
ManufacturingProcess33   42.86
ManufacturingProcess17   41.56
ManufacturingProcess04   40.26
ManufacturingProcess09   38.96
ManufacturingProcess28   36.36
ManufacturingProcess29   28.57
BiologicalMaterial03     27.27
BiologicalMaterial08     25.97
BiologicalMaterial09     25.97
ManufacturingProcess27   20.78
BiologicalMaterial11     19.48
ManufacturingProcess10   16.88
ManufacturingProcess37   15.58
ManufacturingProcess45   14.29

plot(varImp(cubistTuned), top = 20)

Comment: The manufacturing process variables dominate the list with a 13:7 ratio, while the optimal linear and nonlinear models show a relatively more balanced distribution, with ratios of 11:9.

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(cubistTuned), top = 10,
     main = "Tree: Top 10 Important Predictors")

Note: Linear and Nonlinear plots pasted from previous homework.

Comment: The top 10 important predictors from the Cubist method has seven (7) predictors in common with the optimal linear and nonlinear models. The Cubist method only showcases three (3) Biological predictors, whereas the optimal linear and nonlinear showcase four (4). ManufacturingProcess32 remains the most important predictors in all three models.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)

rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])

plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))

tree_model <- rpart(Yield ~ ., data = Chemical[index, ], method = "anova", control = rpart.control(cp = 0.01))

rpart.plot(tree_model, type = 4, extra = 101, fallen.leaves = TRUE)

Comment: Yes, this view of the data provides additional knowledge about the biological or process predictors and their relationship with yield because Manufacturing Process predictors appear early (and more often), it suggests they have a strong impact on yield.