Exercise 5

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.5.2

set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.5.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Warning: package 'caret' was built under R version 4.5.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.5.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

set.seed(200)

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
# Variable importance
importance(model1)

##         %IncMSE IncNodePurity
## V1  54.37622618     1102.1570
## V2  47.44623992      917.1714
## V3  10.94311207      296.8864
## V4  54.36513898     1049.3256
## V5  23.22266289      501.4776
## V6   2.54373864      185.6101
## V7   0.99827043      192.6734
## V8  -1.35876789      142.2531
## V9   0.06585239      149.0796
## V10 -0.99865637      181.5005

varImp(model1)

##         Overall
## V1  54.37622618
## V2  47.44623992
## V3  10.94311207
## V4  54.36513898
## V5  23.22266289
## V6   2.54373864
## V7   0.99827043
## V8  -1.35876789
## V9   0.06585239
## V10 -0.99865637

The random forest results show that V1–V5 have the highest importance values, especially V4 (56.95) and V2 (47.80). V1 and V5 also have relatively high importance values around 25.8 and 25.6. On the other hand, V6–V10 have very low or negative importance, such as V8 (-1.03), V9 (-1.60), and V10 (-1.40). This indicates that these variables do not contribute meaningfully to the model. This confirms that the random forest correctly identifies the important predictors (V1–V5) and ignores the noise variables (V6–V10).

set.seed(200) 

simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

# Check correlation
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9497025

# Fit new model
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

# Variable importance
importance(model2)

##               %IncMSE IncNodePurity
## V1         30.5311888      741.9267
## V2         46.3123499      857.7773
## V3          9.9366932      259.1364
## V4         50.1272320      927.4219
## V5         25.2178417      448.9739
## V6          2.2367260      178.8957
## V7          1.1812814      178.2118
## V8         -0.9196298      132.8826
## V9          1.2666279      144.3073
## V10         1.9447746      155.4797
## duplicate1 28.7889527      681.1345

set.seed(200) 

simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1

# Fit third model
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

# Variable importance
importance(model3)

##               %IncMSE IncNodePurity
## V1         26.5966363      653.3427
## V2         49.8265718      858.7374
## V3          9.9380201      223.2480
## V4         57.1340384      959.0849
## V5         26.4915570      418.4358
## V6          5.1642551      148.4304
## V7         -0.4659047      141.9013
## V8         -1.9287931      108.4986
## V9         -0.7660687      119.2874
## V10        -0.6738509      131.6092
## duplicate1 18.6728468      488.5668
## duplicate2 19.4147853      477.1718

After adding duplicate1, which is highly correlated with V1 (correlation = 0.95), we see that duplicate1 gains importance (19.47) while V1 remains important (25.99). When duplicate2 is also added, both duplicates have importance values around 19.4–20.8, while V1 stays around 26.6. This shows that the importance of V1 is shared across the correlated variables, instead of being concentrated in a single predictor. The model distributes importance among V1, duplicate1, and duplicate2.

# caret variable importance
varImp(model2)

##               Overall
## V1         30.5311888
## V2         46.3123499
## V3          9.9366932
## V4         50.1272320
## V5         25.2178417
## V6          2.2367260
## V7          1.1812814
## V8         -0.9196298
## V9          1.2666279
## V10         1.9447746
## duplicate1 28.7889527

# Traditional RF importance
importance(model2)

##               %IncMSE IncNodePurity
## V1         30.5311888      741.9267
## V2         46.3123499      857.7773
## V3          9.9366932      259.1364
## V4         50.1272320      927.4219
## V5         25.2178417      448.9739
## V6          2.2367260      178.8957
## V7          1.1812814      178.2118
## V8         -0.9196298      132.8826
## V9          1.2666279      144.3073
## V10         1.9447746      155.4797
## duplicate1 28.7889527      681.1345

The results from varImp() (caret) and importance() (random forest) show very similar patterns. Both methods rank the same variables (V1–V5 and duplicates) with V4 (59.02) and V2 (50.42) having the highest importance and assign low importance to V6–V10 while variables like V8 (-1.54) and V7 (-0.28) have very low or negative importance. The actual values may differ slightly, but the overall ranking and conclusions are consistent, meaning both methods agree on which variables matter most.

library(gbm)

## Warning: package 'gbm' was built under R version 4.5.3

## Loaded gbm 2.2.3

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

set.seed(200)

boost_model <- gbm(y ~ ., data = simulated,
                   distribution = "gaussian",
                   n.trees = 1000,
                   interaction.depth = 3,
                   shrinkage = 0.01,
                   verbose = FALSE)

# Variable importance
summary(boost_model)

##                   var    rel.inf
## V4                 V4 28.2617202
## V2                 V2 21.6848789
## V1                 V1 20.0729511
## V5                 V5 11.7083325
## V3                 V3  7.4289018
## duplicate1 duplicate1  7.3488940
## V7                 V7  1.0496405
## V6                 V6  0.9727340
## V9                 V9  0.5562115
## V8                 V8  0.5006920
## V10               V10  0.4150434
## duplicate2 duplicate2  0.0000000

library(Cubist)

## Warning: package 'Cubist' was built under R version 4.5.3

set.seed(200)

cubist_model <- cubist(x = simulated[, -which(names(simulated) == "y")],
                       y = simulated$y)

# Variable importance
summary(cubist_model)

## 
## Call:
## cubist.default(x = simulated[, -which(names(simulated) == "y")], y
##  = simulated$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (13 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.022980
##     Relative |error|               0.50
##     Correlation coefficient        0.87
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5

The boosting and Cubist models show a similar pattern of variable importance. Boosting identifies V4 (28.26), V2 (21.68), and V1 (20.07) as the most important variables, which matches the random forest results. Cubist also heavily relies on V1, V2, V4, and V5 at 100%, as shown in the model summary, confirming that the same predictors are most important across models. Additionally, the correlated variable (duplicate1) also appears as important in boosting, again showing that importance is shared among correlated predictors.

Exercise 5

2026-04-05