Part 0. Setup

library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(67)

simulated <- mlbench.friedman1(200, sd = 1)

simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)

colnames(simulated)[ncol(simulated)] <- "y"

featurePlot(
  x = simulated[, 1:10],   # predictor columns
  y = simulated$y,          # response
  plot = "scatter",
  layout = c(5, 2)          # 5 cols, 2 rows
)

Part A. Base Random Forest

From varImpPlot we can see that no, the random tree does not significantly use the uninformative predictors. Just barely using V6 at ~5% and V9 at ~4%.

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(67)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)

varImpPlot(model1, main = "Base Rand Tree", type = 1)

Part B. Rand Tree with V1 Duplicate

The importance score did in fact change for V1 when we added a highly correlated duplicate predictor. The importance went down and for V1.

simulated$dupe1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$dupe1, simulated$V1)
## [1] 0.9459032
set.seed(67)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)

varImpPlot(model2, main = "Rand Tree w/ V1 Dupe", type = 1)

Part B.2 Rand Tree with Two V1 Duplicates

Although the change in importance wasn’t as great going from 1 duplicate to 2 duplicates. V1 still went down in importance slightly. It also slightly increased the importance of every other informative predictor.

simulated$dupe2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$dupe2, simulated$V1)
## [1] 0.9373142
set.seed(67)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

varImpPlot(model3, main = "Rand Tree w/ Two V1 Dupes", type = 1)

Part C. Conditional Inference Trees.

From the charts we can see that unconditional still makes V1 less important overall. Even giving more importance to dupe2. V1, dupe1, and dupe2 are all fighting for importance here, letting V2 overtake all three for second most important predictor.

Conditional meanwhile massively decreases the importance on V1 and the dupe predictors.

library(party)
## Warning: package 'party' was built under R version 4.5.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.5.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.5.3
set.seed(67)
model4 <- cforest(y ~ ., data = simulated,
                  controls = cforest_unbiased(ntree = 1000, mtry = 5))

rfImp4_uncond <- varimp(model4, conditional = FALSE)
rfImp4_cond <- varimp(model4, conditional = TRUE)

dotchart(sort(rfImp4_uncond),
         labels = names(sort(rfImp4_uncond)),
         main   = "Unconditional Imp")

dotchart(sort(rfImp4_cond),
         labels = names(sort(rfImp4_cond)),
         main   = "Conditional Imp")

part D. Boosted and Cubist Trees

Looking at the boosted tree. It seems like the same pattern occurs. The duplicates don’t only cause a small decrease in importance for V1, it actually causes a huge decrease in importance. This places every other informative predictor and dupe2 above it.

The cubist tree model on the other hand actually fixes the problem entirely. It massively decreased the importance of the dupe predictors and had strong importance on V1. Almost back to where it was before the dupes were added.

library(gbm)
## Warning: package 'gbm' was built under R version 4.5.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(67)
model5 <- gbm(y ~ ., data = simulated,
                   distribution = "gaussian",
                   n.trees = 1000)

boostImp <- summary(model5, plotit = FALSE)

dotchart(boostImp$rel.inf,
         labels = boostImp$var,
         main = "Boosted Tree Importance")

library(Cubist)
## Warning: package 'Cubist' was built under R version 4.5.3
x_vars <- simulated[, names(simulated) != "y"]
y_var  <- simulated$y

model6 <- cubist(x = x_vars, y = y_var, committees = 20)

cubistImp <- varImp(model6)

dotchart(sort(cubistImp$Overall),
         labels = rownames(cubistImp)[order(cubistImp$Overall)],
         main = "Cubist Tree Importance")