library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
set.seed(200)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
# Variable importance
importance(model1)
## %IncMSE IncNodePurity
## V1 54.37622618 1102.1570
## V2 47.44623992 917.1714
## V3 10.94311207 296.8864
## V4 54.36513898 1049.3256
## V5 23.22266289 501.4776
## V6 2.54373864 185.6101
## V7 0.99827043 192.6734
## V8 -1.35876789 142.2531
## V9 0.06585239 149.0796
## V10 -0.99865637 181.5005
varImp(model1)
## Overall
## V1 54.37622618
## V2 47.44623992
## V3 10.94311207
## V4 54.36513898
## V5 23.22266289
## V6 2.54373864
## V7 0.99827043
## V8 -1.35876789
## V9 0.06585239
## V10 -0.99865637
The random forest results show that V1–V5 have the highest importance values, especially V4 (56.95) and V2 (47.80). V1 and V5 also have relatively high importance values around 25.8 and 25.6. On the other hand, V6–V10 have very low or negative importance, such as V8 (-1.03), V9 (-1.60), and V10 (-1.40). This indicates that these variables do not contribute meaningfully to the model. This confirms that the random forest correctly identifies the important predictors (V1–V5) and ignores the noise variables (V6–V10).
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
# Check correlation
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
# Fit new model
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
# Variable importance
importance(model2)
## %IncMSE IncNodePurity
## V1 30.5311888 741.9267
## V2 46.3123499 857.7773
## V3 9.9366932 259.1364
## V4 50.1272320 927.4219
## V5 25.2178417 448.9739
## V6 2.2367260 178.8957
## V7 1.1812814 178.2118
## V8 -0.9196298 132.8826
## V9 1.2666279 144.3073
## V10 1.9447746 155.4797
## duplicate1 28.7889527 681.1345
set.seed(200)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
# Fit third model
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
# Variable importance
importance(model3)
## %IncMSE IncNodePurity
## V1 26.5966363 653.3427
## V2 49.8265718 858.7374
## V3 9.9380201 223.2480
## V4 57.1340384 959.0849
## V5 26.4915570 418.4358
## V6 5.1642551 148.4304
## V7 -0.4659047 141.9013
## V8 -1.9287931 108.4986
## V9 -0.7660687 119.2874
## V10 -0.6738509 131.6092
## duplicate1 18.6728468 488.5668
## duplicate2 19.4147853 477.1718
After adding duplicate1, which is highly correlated with V1 (correlation = 0.95), we see that duplicate1 gains importance (19.47) while V1 remains important (25.99). When duplicate2 is also added, both duplicates have importance values around 19.4–20.8, while V1 stays around 26.6. This shows that the importance of V1 is shared across the correlated variables, instead of being concentrated in a single predictor. The model distributes importance among V1, duplicate1, and duplicate2.
# caret variable importance
varImp(model2)
## Overall
## V1 30.5311888
## V2 46.3123499
## V3 9.9366932
## V4 50.1272320
## V5 25.2178417
## V6 2.2367260
## V7 1.1812814
## V8 -0.9196298
## V9 1.2666279
## V10 1.9447746
## duplicate1 28.7889527
# Traditional RF importance
importance(model2)
## %IncMSE IncNodePurity
## V1 30.5311888 741.9267
## V2 46.3123499 857.7773
## V3 9.9366932 259.1364
## V4 50.1272320 927.4219
## V5 25.2178417 448.9739
## V6 2.2367260 178.8957
## V7 1.1812814 178.2118
## V8 -0.9196298 132.8826
## V9 1.2666279 144.3073
## V10 1.9447746 155.4797
## duplicate1 28.7889527 681.1345
The results from varImp() (caret) and importance() (random forest) show very similar patterns. Both methods rank the same variables (V1–V5 and duplicates) with V4 (59.02) and V2 (50.42) having the highest importance and assign low importance to V6–V10 while variables like V8 (-1.54) and V7 (-0.28) have very low or negative importance. The actual values may differ slightly, but the overall ranking and conclusions are consistent, meaning both methods agree on which variables matter most.
library(gbm)
## Warning: package 'gbm' was built under R version 4.5.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(200)
boost_model <- gbm(y ~ ., data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.01,
verbose = FALSE)
# Variable importance
summary(boost_model)
## var rel.inf
## V4 V4 28.2617202
## V2 V2 21.6848789
## V1 V1 20.0729511
## V5 V5 11.7083325
## V3 V3 7.4289018
## duplicate1 duplicate1 7.3488940
## V7 V7 1.0496405
## V6 V6 0.9727340
## V9 V9 0.5562115
## V8 V8 0.5006920
## V10 V10 0.4150434
## duplicate2 duplicate2 0.0000000
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.5.3
set.seed(200)
cubist_model <- cubist(x = simulated[, -which(names(simulated) == "y")],
y = simulated$y)
# Variable importance
summary(cubist_model)
##
## Call:
## cubist.default(x = simulated[, -which(names(simulated) == "y")], y
## = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition]
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (13 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
##
## outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.022980
## Relative |error| 0.50
## Correlation coefficient 0.87
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
The boosting and Cubist models show a similar pattern of variable importance. Boosting identifies V4 (28.26), V2 (21.68), and V1 (20.07) as the most important variables, which matches the random forest results. Cubist also heavily relies on V1, V2, V4, and V5 at 100%, as shown in the model summary, confirming that the same predictors are most important across models. Additionally, the correlated variable (duplicate1) also appears as important in boosting, again showing that importance is shared among correlated predictors.