library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Random Forest Model + Variable Importance
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Question:
Did the random forest model significantly use the uninformative
predictors (V6–V10)?
No, the random forest model did not significantly use the uninformative predictors (V6–V10). The top predictors have substantially larger positive importance scores, while V6–V10 have values close to zero or slightly negative. Importance values near zero indicate little to no predictive contribution, and negative values suggest the variables may introduce noise rather than useful signal. This shows the model correctly focused on the informative predictors and largely ignored the uninformative ones.
(b) Add Correlated Predictor
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025
#Fit New Model
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.00709780
## V2 6.05937899
## V3 0.58465293
## V4 6.86363294
## V5 2.19939891
## V6 0.10898039
## V7 0.06104207
## V8 -0.04059204
## V9 0.06123662
## V10 0.09999339
## duplicate1 4.43323167
Questions:
Did the importance score for V1 change?
The importance score for V1 decreased slightly after adding the highly correlated predictor, dropping from around 8.7 to about 6.01. This indicates that some of V1’s importance has been redistributed to the new correlated variable.
What happens when adding a highly correlated predictor?
When a highly correlated predictor is added, the model spreads the importance between the correlated variables instead of assigning it all to one. Because both predictors contain similar information, the random forest can use either one in different trees, which reduces the individual importance of each variable. As a result, their importance scores decrease even though the overall predictive power of the model stays about the same.
(c) Conditional Inference Forest
library(party)
## Warning: package 'party' was built under R version 4.5.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.5.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.5.3
cf_model <- cforest(y ~ ., data = simulated)
cf_imp <- varimp(cf_model)
cf_imp
## V1 V2 V3 V4 V5 V6
## 6.813253411 6.230035644 0.030496476 7.342522924 1.896415492 -0.031302902
## V7 V8 V9 V10 duplicate1
## -0.014426737 -0.029347716 -0.003838825 -0.016841560 2.536782153
Question: Do these importances show te same pattern as the traditional random forest?
Yes, the importance values follow a similar pattern to the traditional random forest. The key predictors like V1, V2, and V4 still have the highest scores, while the uninformative variables from V6 to V10 remain near zero or negative. The correlated variable duplicate1 also takes on some importance, indicating that the model is dividing the predictive signal between related features. Overall, the model continues to emphasize the important variables and minimize the influence of noise, consistent with the earlier results.
(d) Other Tree Models
library(gbm)
## Warning: package 'gbm' was built under R version 4.5.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.5.3
#Fit Boosted Trees
gbm_model <- gbm(y ~ ., data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3)
summary(gbm_model)
## var rel.inf
## V4 V4 27.600559
## V1 V1 20.055311
## V2 V2 19.404864
## V5 V5 12.848122
## V3 V3 7.479351
## duplicate1 duplicate1 4.963427
## V6 V6 2.035954
## V7 V7 1.562304
## V9 V9 1.457551
## V8 V8 1.394830
## V10 V10 1.197728
#Fit Cubist
cubist_model <- cubist(x = simulated[,-ncol(simulated)],
y = simulated$y)
cubist_model
##
## Call:
## cubist.default(x = simulated[, -ncol(simulated)], y = simulated$y)
##
## Number of samples: 200
## Number of predictors: 11
##
## Number of committees: 1
## Number of rules: 1
Question: Does the same importance pattern occur?
Yes, the same overall pattern is still present. The key variables like V4, V1, and V2 continue to have the highest importance, while variables V6 through V10 remain much lower. The correlated variable also shows some importance, indicating that the model is still splitting influence between related features. Overall, the model continues to prioritize the most informative predictors and give less weight to weaker ones.
Use a simulation to show tree bias with different granularities.
library(randomForest)
library(party)
set.seed(123)
# Simulate data
n <- 1000
y <- rnorm(n) # pure noise outcome
# Predictors with different granularities
x1 <- rnorm(n) # continuous (many split points)
x2 <- sample(1:20, n, replace=TRUE) # medium granularity
x3 <- sample(1:5, n, replace=TRUE) # low granularity
x4 <- sample(0:1, n, replace=TRUE) # binary
data <- data.frame(y, x1, x2, x3, x4)
# Standard Random Forest
rf_model <- randomForest(y ~ ., data=data, importance=TRUE)
importance(rf_model)
## %IncMSE IncNodePurity
## x1 1.082533 93.738123
## x2 -2.321591 54.245552
## x3 -4.252093 24.725638
## x4 -2.068891 8.619194
# Conditional Inference Forest (less biased)
cf_model <- cforest(y ~ ., data=data,
controls = cforest_unbiased(ntree=500, mtry=2))
varimp(cf_model)
## x1 x2 x3 x4
## -0.011511280 -0.005763513 -0.017214835 -0.012746774
These results illustrate tree bias caused by different levels of granularity. Even though all predictors are noise, x1, which is continuous and has the most possible split points, shows the highest importance, especially in IncNodePurity. As the number of categories decreases from x2 to x4, their importance also declines. This occurs because variables with more split options are more likely to appear useful by chance. In contrast, the conditional inference results at the bottom show all variables with values near zero, confirming that none are truly important and that this method reduces the bias.
(a) Why does the right model focus importance on fewer predictors while the left spreads importance?
The model on the right focuses its importance on just a few predictors because it uses a high learning rate and a large bagging fraction. This means each tree makes strong updates and is built using most of the data, so the model quickly locks onto the strongest predictors and gives them most of the importance. In contrast, the model on the left uses a smaller learning rate and less data per tree, so it learns more gradually and spreads importance across a larger number of predictors.
(b) Which model is more predictive of other samples?
The model on the left would likely be more predictive for new samples. Its lower learning rate and smaller bagging fraction act as regularization, helping prevent overfitting. By spreading importance across more predictors, it relies less on a small set of variables that may capture noise, making it more robust when applied to new data.
(c) How does increasing interaction depth affect predictor importance?
Increasing interaction depth allows the model to capture more complex relationships between predictors. As a result, importance tends to become more concentrated on variables involved in strong interactions, making the importance curve steeper. This effect would be stronger in the right model, where importance is already concentrated, while the left model would still show a more balanced distribution due to its more gradual learning.
(a) Best Model Performance
# Libraries
library(caret)
library(gbm)
library(randomForest)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.3
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(123)
# Load data
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
# Split FIRST
trainIndex <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]
# Impute missing values using median
preProc <- preProcess(train, method = "medianImpute")
train_imputed <- predict(preProc, train)
test_imputed <- predict(preProc, test)
# Cross-validation
ctrl <- trainControl(method = "cv", number = 5)
# Models
tree_model <- train(Yield ~ ., data = train_imputed,
method = "rpart",
trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rf_model <- train(Yield ~ ., data = train_imputed,
method = "rf",
trControl = ctrl,
tuneLength = 5)
gbm_model <- train(Yield ~ ., data = train_imputed,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
tuneLength = 5)
# Compare
resamples(list(Tree = tree_model,
RF = rf_model,
GBM = gbm_model)) %>% summary()
##
## Call:
## summary.resamples(object = .)
##
## Models: Tree, RF, GBM
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 1.0742174 1.0877440 1.0970787 1.1179499 1.1426969 1.1880124 0
## RF 0.6531250 0.7868625 0.8220182 0.8927160 1.0618623 1.1397121 0
## GBM 0.7251532 0.7420665 0.8214573 0.8370968 0.9165505 0.9802565 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 1.3112961 1.3149071 1.315060 1.411250 1.483606 1.631383 0
## RF 0.9832113 0.9886813 1.043563 1.182022 1.317870 1.576785 0
## GBM 0.9115073 1.0433393 1.080267 1.070631 1.103301 1.214740 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.3676634 0.3680714 0.4732691 0.4341356 0.4755086 0.4861655 0
## RF 0.4132577 0.5796483 0.6327096 0.6052876 0.6917760 0.7090465 0
## GBM 0.6101213 0.6430015 0.7007906 0.6810526 0.7212323 0.7301173 0
# Test performance
pred_tree <- predict(tree_model, test_imputed)
pred_rf <- predict(rf_model, test_imputed)
pred_gbm <- predict(gbm_model, test_imputed)
postResample(pred_tree, test_imputed$Yield)
## RMSE Rsquared MAE
## 1.7003910 0.2248231 1.3158401
postResample(pred_rf, test_imputed$Yield)
## RMSE Rsquared MAE
## 1.3225839 0.4840139 0.9976617
postResample(pred_gbm, test_imputed$Yield)
## RMSE Rsquared MAE
## 1.1050271 0.6464494 0.8851463
Question:
Which model gives the best resampling and test performance?
The gradient boosting model (GBM) provides the best performance among the tree-based models. Based on the resampling and test results, GBM has the lowest RMSE and MAE and the highest R² compared to the single decision tree and random forest. This indicates that GBM is better at capturing complex nonlinear relationships in the data. While random forests improve over a single tree by reducing variance, boosting further improves accuracy by sequentially correcting errors, making it the most predictive model in this case.
(b) Variable Importance
# Variable importance for GBM
gbm_imp <- varImp(gbm_model)
# Print importance
print(gbm_imp)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 21.923
## BiologicalMaterial12 18.711
## ManufacturingProcess31 18.097
## ManufacturingProcess09 15.608
## ManufacturingProcess06 14.515
## BiologicalMaterial03 13.097
## ManufacturingProcess01 9.785
## BiologicalMaterial11 8.634
## ManufacturingProcess04 8.632
## ManufacturingProcess27 7.929
## BiologicalMaterial09 7.670
## ManufacturingProcess13 7.533
## ManufacturingProcess25 6.995
## BiologicalMaterial02 6.558
## ManufacturingProcess37 6.371
## BiologicalMaterial10 6.199
## ManufacturingProcess30 6.111
## ManufacturingProcess43 5.830
## BiologicalMaterial01 5.392
# Plot top 10 variables
plot(gbm_imp, top = 10)
Questions:
Which predictors are most important?
The most important predictor in the gradient boosting model is ManufacturingProcess32, which stands out with a much higher importance score than all other variables. The next most important predictors include ManufacturingProcess17, ManufacturingProcess31, ManufacturingProcess09, and ManufacturingProcess06, along with a few biological variables such as BiologicalMaterial12, BiologicalMaterial03, and BiologicalMaterial11. Overall, the model identifies a small group of key predictors that have the strongest influence on yield.
Do biological or process variables dominate?
Process variables clearly dominate the list of important predictors. Most of the top 10 variables are manufacturing process variables, while only a few biological variables appear. This suggests that the operating conditions of the manufacturing process—such as temperature, timing, and processing steps—have a stronger impact on yield than the biological inputs.
How do top 10 predictors compare to linear/nonlinear models?
Process variables clearly dominate the list of important predictors. Most of the top 10 variables are manufacturing process variables, while only a few biological variables appear. This suggests that the operating conditions of the manufacturing process—such as temperature, timing, and processing steps—have a stronger impact on yield than the biological inputs.
(c) Optimal Tree Visualization
rpart.plot(tree_model$finalModel,
type = 2,
extra = 101,
fallen.leaves = TRUE,
cex = 0.8)
Question:
Does this visualization provide insight into predictor relationships
with yield?
Yes, this visualization provides clear insight into how predictors relate to yield. The tree shows that ManufacturingProcess32 is the most important variable since it forms the first split, with values greater than or equal to 160 leading to the highest yield (around 42). When ManufacturingProcess32 is lower, yield decreases and the model further splits on ManufacturingProcess17, indicating an interaction between these variables. Specifically, when both ManufacturingProcess32 is low and ManufacturingProcess17 is below 34, the yield is lowest (around 39), while higher values of ManufacturingProcess17 slightly improve yield (around 40). Overall, the tree highlights important threshold values and demonstrates how combinations of process variables influence yield, providing interpretable decision rules and reinforcing that process conditions play a dominant role.