Exercise 8.1

Setup

library(mlbench)
library(AppliedPredictiveModeling)
library(caret)
library(randomForest)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(patchwork)
library(partykit)
library(Cubist)
library(gbm)

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Section A

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ . , data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

p1 <- rfImp1 %>% 
  mutate(var = rownames(rfImp1)) %>% 
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "Variable Importance", y = "Variable")

p1

Did the random forest model significantly use the uninformative predictors (V6– V10)?

Response: Based on the output above, the RF model did not give any importance to the V6-V10 uninformative predictors, scoring them close to zero.

Section B

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ . , data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9408631

model3 <- randomForest(y ~ . , data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

p2 <- rfImp2 %>% 
  mutate(var = rownames(rfImp2)) %>% 
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "VI Duplicate 1", y = "Variable")

p3 <- rfImp3 %>% 
  mutate(var = rownames(rfImp3)) %>% 
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "VI Duplicate 2", y = "Variable")

(p1 | p2 | p3)

Response: When adding one duplicate predictor (correlation above 94%), V1 becomes the second most important variable according to the data above. When adding a second duplicate predictor (correlation above 94%), V1 becomes the third most important variable emphasizing the importance in feature selection when it comes to highly correlated features like the ones seen above. V1 originally has an importance value of 8.63767413, with the addition of duplicate 1, it drops to 6.363711628 and with the inclusion of duplicate 2, it drops to 5.198015601. RF splits importance across correlated variables which is why we see the distribution across V1, duplicate1 and duplicate 2.

Section C

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ . , data = simulated)
cfImp4False <- varimp(model4, conditional = FALSE) %>% as.data.frame()
cfImp4True <- varimp(model4, conditional = TRUE) %>% as.data.frame()

cfImp4False

##                      .
## V1          5.42152778
## V2          5.79075433
## V3         -0.01425384
## V4          5.96294311
## V5          1.87652947
## V6         -0.06822257
## V7          0.07696942
## V8         -0.17067028
## V9          0.02155362
## V10        -0.13376134
## duplicate1  5.27935126
## duplicate2  2.92224516

cfImp4True

##                       .
## V1          2.442631466
## V2          5.022117020
## V3         -0.007635221
## V4          5.209258304
## V5          1.321172122
## V6         -0.079239512
## V7          0.012405765
## V8         -0.372933317
## V9         -0.106808893
## V10        -0.206104896
## duplicate1  2.130422253
## duplicate2  0.592237637

p4 <- cfImp4False %>% 
  rename(Overall = '.') %>% 
  mutate(var = rownames(cfImp4False)) %>% 
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "Condition = False", y = "Variable")

p5 <- cfImp4True %>% 
  rename(Overall = '.') %>% 
  mutate(var = rownames(cfImp4True)) %>% 
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "Condition = True", y = "Variable")

p4 | p5

Response: When the conditional parameter is set to FALSE, V1’s importance value is set at 6.409840483 becoming the second most important variable and following the pattern of a traditional RF where the importance of V1 decreases as the duplicates gain some importance (distribution). When set to TRUE, V1’s importance value drops to 3.31179002, falling to the third most important variable but the duplicates also drop in importance meaning that it adjusts for correlation and therefore does not follow the pattern of a traditional RF.

Section D

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

gbmModel <- train(y ~ ., data = simulated, method = "gbm", distribution = "gaussian", verbose = FALSE)
gbmImp <- varImp(gbmModel)
gbmImp

## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          78.4226
## duplicate1  50.8513
## V1          43.5310
## V5          36.4829
## V3          32.2035
## duplicate2  11.8373
## V6           2.6307
## V7           2.1681
## V9           1.1758
## V8           0.5614
## V10          0.0000

Cubist

cubistModel <- train(y ~ ., data = simulated, method = "cubist")
cubImp <- varImp(cubistModel)
cubImp

## cubist variable importance
## 
##            Overall
## V2          100.00
## V1           77.70
## V4           71.94
## V5           54.68
## V3           46.04
## duplicate2   35.97
## duplicate1   35.97
## V6           14.39
## V9            0.00
## V8            0.00
## V7            0.00
## V10           0.00

p5 <- gbmImp$importance %>% 
  mutate(var = rownames(gbmImp$importance)) %>% 
  ggplot(aes(Overall, reorder(var, Overall), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "Boosted Tree Model: VI", y = "Variable")

p6 <- cubImp$importance %>% 
  mutate(var = rownames(cubImp$importance)) %>% 
  ggplot(aes(Overall, reorder(var, Overall), var)) +
  geom_col(fill = "seagreen") +
  labs(title = "Cubist Model: VI", y = "Variable")

p5 | p6

Response: When fitting boosted tress to the simulated data, the variable importance scores from the correlated duplicates are less favored than the traditional RF because once V1 was used, the duplicates added less new information than what was already given with V1. As for the cubist model, it displays a mixed pattern in regards to the traditional RF, where the importance is less evenly split and some duplicates (duplicate 2) get emphasized more than others.

Exercise_5

Sofia Hernandez

2026-03-30