Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6– V10)?

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##         Overall
## V1   8.86329776
## V2   6.72851763
## V3   0.84145353
## V4   7.60284159
## V5   2.26864193
## V6   0.11268425
## V7   0.07374772
## V8  -0.07210708
## V9  -0.06913906
## V10 -0.10577619

No, the random forest model did not significantly use the uninformative predictors (V6– V10), as shown by importance values near 0.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9356508
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.78865821
## V2          6.48386619
## V3          0.55469974
## V4          6.78484850
## V5          1.96248183
## V6          0.10126938
## V7          0.14210730
## V8         -0.09726812
## V9         -0.08440763
## V10         0.04878300
## duplicate1  4.64551303
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9432429
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                 Overall
## V1          4.642486671
## V2          6.999126963
## V3          0.410061203
## V4          7.088994719
## V5          1.989570698
## V6          0.152581329
## V7         -0.020144331
## V8         -0.077797272
## V9         -0.019773933
## V10         0.004167046
## duplicate1  3.625925699
## duplicate2  2.729716790

Yes, the importance of V1 decreased to 5.79 from 8.86. By adding another predictor, the importance decreased to 4.64.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
cf_model <- cforest(y ~ ., data = simulated)

cfImp <- varimp(cf_model, conditional = FALSE)
cfImpMod <- varimp(cf_model, conditional = TRUE)

cfImp
##           V1           V2           V3           V4           V5           V6 
##  5.483627440  6.305633822  0.024151673  7.224490398  1.652266288 -0.015582595 
##           V7           V8           V9          V10   duplicate1   duplicate2 
## -0.001313370 -0.005437581  0.026731626 -0.013254646  1.831954644  2.804676212
cfImpMod
##           V1           V2           V3           V4           V5           V6 
##  2.083838071  4.830593732 -0.009487717  5.773446339  1.140880775  0.003183752 
##           V7           V8           V9          V10   duplicate1   duplicate2 
## -0.005794233 -0.013262723 -0.008223196 -0.015798371  0.865926318  0.937012669

The modified (conditional) importance shows a different pattern compared to the traditional random forest model with V1, V2, and V4 showing a significantly lower values (1.95, 4.95, and 5.81 respectively), compared to the 4.64, 7.00, and 7.09 of the original. While the traditional importance measure (both in random forests and in cforest without conditioning) distributes importance across correlated predictors, the conditional importance reduces this bias. As a result, the importance values for variables like V1 are lower and more accurately reflect their unique contribution, rather than shared importance with correlated variables. Thus, the conditional importance does not follow the same pattern as the traditional random forest model and provides a more reliable assessment in the presence of correlated predictors.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)
gbmTune <- train(y ~ ., data = simulated,
                   method = "gbm",
                   verbose = FALSE)

varImp(gbmTune)
## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          72.9037
## V1          36.3824
## V5          36.0867
## V3          26.2417
## duplicate2  25.7814
## duplicate1  25.6807
## V7           2.4011
## V6           1.8590
## V10          1.1521
## V8           0.7665
## V9           0.0000
cubistTune <- train(y ~ ., data = simulated,
                    method = "cubist")

varImp(cubistTune)
## cubist variable importance
## 
##            Overall
## V1         100.000
## V2          85.507
## V4          71.014
## V3          58.696
## V5          49.275
## V6          11.594
## duplicate2   7.971
## V9           0.000
## V10          0.000
## duplicate1   0.000
## V8           0.000
## V7           0.000

The boosted tree model assigns the greatest importance to V4, followed by V2, V1, V5, and V3, while the Cubist model assigns the greatest importance to V1, followed by V2, V4, V3, and V5. Despite differences in ranking, both models show a similar pattern to random forests: when correlated predictors are present, the importance of V1 is reduced and shared among correlated variables. This indicates that both boosted trees and Cubist are also affected by multicollinearity, leading to a redistribution of importance and making interpretation less reliable.