library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.86329776
## V2 6.72851763
## V3 0.84145353
## V4 7.60284159
## V5 2.26864193
## V6 0.11268425
## V7 0.07374772
## V8 -0.07210708
## V9 -0.06913906
## V10 -0.10577619
No, the random forest model did not significantly use the uninformative predictors (V6– V10), as shown by importance values near 0.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
cf_model <- cforest(y ~ ., data = simulated)
cfImp <- varimp(cf_model, conditional = FALSE)
cfImpMod <- varimp(cf_model, conditional = TRUE)
cfImp
## V1 V2 V3 V4 V5 V6
## 5.483627440 6.305633822 0.024151673 7.224490398 1.652266288 -0.015582595
## V7 V8 V9 V10 duplicate1 duplicate2
## -0.001313370 -0.005437581 0.026731626 -0.013254646 1.831954644 2.804676212
cfImpMod
## V1 V2 V3 V4 V5 V6
## 2.083838071 4.830593732 -0.009487717 5.773446339 1.140880775 0.003183752
## V7 V8 V9 V10 duplicate1 duplicate2
## -0.005794233 -0.013262723 -0.008223196 -0.015798371 0.865926318 0.937012669
The modified (conditional) importance shows a different pattern compared to the traditional random forest model with V1, V2, and V4 showing a significantly lower values (1.95, 4.95, and 5.81 respectively), compared to the 4.64, 7.00, and 7.09 of the original. While the traditional importance measure (both in random forests and in cforest without conditioning) distributes importance across correlated predictors, the conditional importance reduces this bias. As a result, the importance values for variables like V1 are lower and more accurately reflect their unique contribution, rather than shared importance with correlated variables. Thus, the conditional importance does not follow the same pattern as the traditional random forest model and provides a more reliable assessment in the presence of correlated predictors.
library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)
gbmTune <- train(y ~ ., data = simulated,
method = "gbm",
verbose = FALSE)
varImp(gbmTune)
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 72.9037
## V1 36.3824
## V5 36.0867
## V3 26.2417
## duplicate2 25.7814
## duplicate1 25.6807
## V7 2.4011
## V6 1.8590
## V10 1.1521
## V8 0.7665
## V9 0.0000
cubistTune <- train(y ~ ., data = simulated,
method = "cubist")
varImp(cubistTune)
## cubist variable importance
##
## Overall
## V1 100.000
## V2 85.507
## V4 71.014
## V3 58.696
## V5 49.275
## V6 11.594
## duplicate2 7.971
## V9 0.000
## V10 0.000
## duplicate1 0.000
## V8 0.000
## V7 0.000
The boosted tree model assigns the greatest importance to V4, followed by V2, V1, V5, and V3, while the Cubist model assigns the greatest importance to V1, followed by V2, V4, V3, and V5. Despite differences in ranking, both models show a similar pattern to random forests: when correlated predictors are present, the importance of V1 is reduced and shared among correlated variables. This indicates that both boosted trees and Cubist are also affected by multicollinearity, leading to a redistribution of importance and making interpretation less reliable.