library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"DATA 624 Homework 9
1-) Recreate the simulated data from Exercise 7.2:
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 Overall
V1 8.732235404
V2 6.415369387
V3 0.763591825
V4 7.615118809
V5 2.023524577
V6 0.165111172
V7 -0.005961659
V8 -0.166362581
V9 -0.095292651
V10 -0.074944788
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)Comment: The random forest model did not significantly use the uninformative predictors.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)[1] 0.9418838
Fit another random forest model to these data. Did the importance score for V1 change?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 Overall
V1 6.02363848
V2 6.19154188
V3 0.55277883
V4 6.92793183
V5 2.17101110
V6 0.15369922
V7 0.10720626
V8 0.00929209
V9 -0.05010858
V10 0.03861636
duplicate1 3.82307834
Comment: Yes, the importance score of V1 changed.
What happens when you add another predictor that is also highly correlated with V1?
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)[1] 0.9430605
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 Overall
V1 4.750274828
V2 6.392645096
V3 0.546932231
V4 6.694197135
V5 2.354901393
V6 0.178559997
V7 0.003137176
V8 -0.067194296
V9 -0.088150851
V10 -0.040809537
duplicate1 2.718782851
duplicate2 2.914196065
Comment: After another predictor highly correlated to V1 predictor was added, the important scores for the other variables remain approximately the same (minor changes) while the importance score for V1 decreased significantly (from 8.73 to 6.02, then 4.75).
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func-tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4 V1 V2 V3 V4 V5 V6
5.278213381 5.357164042 0.013611736 6.450219543 1.208606076 0.032344932
V7 V8 V9 V10
0.027047409 -0.016253565 0.007697236 0.009512732
Comment: The importance scores exhibit a pattern similar to that of the traditional random forest model. In the first model, V1 emerged as the most important variable, followed by V4. In contrast, the cforest model reversed this order, ranking V4 as the most important and V1 as second.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
- Boosted trees
library(gbm)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)
rfImp5 Overall
V1 4634.0234
V2 4316.6044
V3 1310.7178
V4 4287.0793
V5 1844.2147
V6 416.3966
V7 368.9383
V8 224.4769
V9 246.1400
V10 250.6263
Comment: Although the importance scores are higher overall, the pattern remains largely consistent with the traditional random forest model. V1 continues to be the most important variable.
- Cubist
library(Cubist)
set.seed(1001)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)
rfImp6 Overall
V1 72.0
V3 42.0
V2 54.5
V4 49.0
V5 40.0
V6 11.0
V7 0.0
V8 0.0
V9 0.0
V10 0.0
Comment: Cubist also shows similar patterns as other patterns with V1 leading.
2-) Use a simulation to show tree bias with different granularities.
library(caret)
library(partykit)
library(party)
library(ggplot2)
library(rpart)
library(ipred)
set.seed(002)
a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)
y <- a + b + c + d + e
simData <- data.frame(a,b,c,d,e,y)
rpartTree <- rpart(y ~ ., data = simData)
plot(as.party(rpartTree), gp = gpar(fontsize = 7))varImp(rpartTree) Overall
a 4.083436
b 3.513219
c 2.599040
d 3.727643
e 3.471319
Comment: When a variable has a larger number of distinct values, decision trees are more likely to select it over others, even if it’s not informative. This increases the likelihood that the model will prioritize noisy variables over truly informative ones, especially near the top of the tree.
In this simulation, the tree-based model tended to assign higher importance to variables with more unique values. It also frequently chose noisy or highly repetitive variables as the root node, potentially leading to misleading interpretations.
3-) In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Comment: The bagging fraction refers to the proportion of the training data used in each iteration, while the learning rate controls how much of the current prediction is added to the previous one—essentially determining how quickly the model learns.
A lower learning rate is generally preferred because it allows the model to learn more gradually over a greater number of iterations, often leading to better performance. The model on the left, which uses a smaller learning rate and bagging fraction, learns more slowly and requires more computation, but this often results in improved accuracy. It also relies on a smaller subset of the data at each step.
In contrast, the model on the right likely suffers from overfitting due to its higher learning rate and bagging fraction. It uses more of the training data per iteration and updates predictions more aggressively. This faster learning can cause the model to place too much emphasis on the first few predictors, skewing feature importance and reducing generalization.
(b) Which model do you think would be more predictive of other samples?
Comment: The model on the left is likely to perform better, as its greater number of iterations reduces the influence of any single predictor. It is similar to a distribution with a greater R2. This more gradual learning process enhances its predictive accuracy on unseen samples.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Interaction depth is the number of splits to perform on a tree, or the maximum nodes per tree. Increasing the interaction depth, allows the model to capture more complex interactions between features. the importance of the predictors increases, allowing the smaller important predictors to contribute more. This can have a noticeable effect on the slope (slope steeper) of predictor importance, especially when visualized.
4-) Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
set.seed(10001)
# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(00009)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1](a) Which tree-based regression model gives the optimal resampling and test set performance?
- Single Trees
set.seed(1001)
cartTune <- train(train_x, train_y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
cartPred <- predict(cartTune, test_x)
postResample(cartPred, test_y) RMSE Rsquared MAE
1.4697466 0.4223485 1.1158856
- Bagged Trees
set.seed(10010)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y) RMSE Rsquared MAE
1.1662654 0.6252609 0.8500063
- Random Forest
library(randomForest)
set.seed(2100)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y) RMSE Rsquared MAE
1.0866922 0.6708165 0.8106306
- Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y) RMSE Rsquared MAE
1.0344506 0.7005943 0.7667668
- Cubist
set.seed(120)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y) RMSE Rsquared MAE
0.9468747 0.7625873 0.6869154
rbind(cart = postResample(cartPred, test_y),
bagged = postResample(baggedPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y)) RMSE Rsquared MAE
cart 1.4697466 0.4223485 1.1158856
bagged 1.1662654 0.6252609 0.8500063
randomForest 1.0866922 0.6708165 0.8106306
boosted 1.0344506 0.7005943 0.7667668
cubist 0.9468747 0.7625873 0.6869154
Comment: The Cubist model has the lowest RMSE and MAE, and the highest R2 . It gives the optimal resampling and test set performance.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?
varImp(cubistTuned, scale = TRUE)cubist variable importance
only 20 most important variables shown (out of 56)
Overall
ManufacturingProcess32 100.00
BiologicalMaterial06 83.12
ManufacturingProcess13 77.92
BiologicalMaterial02 55.84
BiologicalMaterial12 46.75
ManufacturingProcess39 42.86
ManufacturingProcess33 42.86
ManufacturingProcess17 41.56
ManufacturingProcess04 40.26
ManufacturingProcess09 38.96
ManufacturingProcess28 36.36
ManufacturingProcess29 28.57
BiologicalMaterial03 27.27
BiologicalMaterial08 25.97
BiologicalMaterial09 25.97
ManufacturingProcess27 20.78
BiologicalMaterial11 19.48
ManufacturingProcess10 16.88
ManufacturingProcess37 15.58
ManufacturingProcess45 14.29
plot(varImp(cubistTuned), top = 20) Comment: The manufacturing process variables dominate the list with a 13:7 ratio, while the optimal linear and nonlinear models show a relatively more balanced distribution, with ratios of 11:9.
- How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cubistTuned), top = 10,
main = "Tree: Top 10 Important Predictors")Note: Linear and Nonlinear plots pasted from previous homework.
Comment: The top 10 important predictors from the Cubist method has seven (7) predictors in common with the optimal linear and nonlinear models. The Cubist method only showcases three (3) Biological predictors, whereas the optimal linear and nonlinear showcase four (4). ManufacturingProcess32 remains the most important predictors in all three models.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart.plot)
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))tree_model <- rpart(Yield ~ ., data = Chemical[index, ], method = "anova", control = rpart.control(cp = 0.01))
rpart.plot(tree_model, type = 4, extra = 101, fallen.leaves = TRUE)Comment: Yes, this view of the data provides additional knowledge about the biological or process predictors and their relationship with yield because Manufacturing Process predictors appear early (and more often), it suggests they have a strong impact on yield.