library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.2
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(
y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000
)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The Random Forest model mostly relied on variables V1 to V5, which have much higher importance scores. In contrast, V6 to V10 have very low or even negative scores, meaning the model didn’t really use them to make decisions. So, it’s clear that the uninformative predictors (V6–V10) didn’t play a significant role.
party
package to
fit a random forest model using conditional inference trees. The
party
package function varimp
can calculate
predictor importance. The conditional
argument of that
function toggles between the traditional importance measure and the
modified version described in Strobl et al. (2007). Do these importances
show the same pattern as the traditional random forest model?library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.2
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.2
simulated <- simulated[, c(1:11)]
model4 <- cforest(
y ~ .,
data = simulated
)
varimp(model4, conditional = TRUE)
## V1 V2 V3 V4 V5 V6
## 5.471457361 5.166826657 0.020994281 6.689072245 1.256076719 0.004925215
## V7 V8 V9 V10
## -0.008184439 -0.009141850 -0.012594617 -0.013200299
The results from the cforest
model using the
party
package show a similar pattern to the traditional
random forest model. Predictors like V1, V3, and V4 still have the
highest importance scores, while the others, especially V6 through V10,
show very low or even negative values. This means the model is
consistently identifying the truly informative variables, regardless of
which method is used. The main difference is that cforest
adjusts for biases like variable correlation, making it a more reliable
option when predictors are related.
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
global_tr_control <- trainControl(method = "cv", allowParallel = TRUE)
boost_tree_grid <- expand.grid(
.interaction.depth = seq(1, 7, by = 1),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(.01, .5, by = .05),
.n.minobsinnode = c(5:15)
)
boost_fit <- train(
simulated[, c(1:10)],
simulated$y,
method = "gbm",
tuneGrid = boost_tree_grid,
verbose = FALSE,
trControl = global_tr_control
)
varImp(boost_fit, scale = FALSE)
## gbm variable importance
##
## Overall
## V4 9420.3
## V1 8484.2
## V2 7416.5
## V5 3714.4
## V3 3088.3
## V7 548.1
## V6 534.3
## V8 434.6
## V9 382.7
## V10 381.0
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.4.3
cubist_fit <- cubist(
simulated[, c(1:10)],
simulated$y
)
varImp(cubist_fit, scale = FALSE)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
library(rpart)
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.4.3
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
set.seed(19940211)
c1 <- sample(1:10 / 10, 2000, replace = TRUE)
c2 <- sample(1:100 / 100, 2000, replace = TRUE)
c3 <- sample(1:1000 / 1000, 2000, replace = TRUE)
c4 <- sample(1:10000 / 10000, 2000, replace = TRUE)
y <- c1 + c2 + c3 + c4 + rnorm(100, mean = 0, sd = 1.5)
train_tree_data <- data.frame(c1, c2, c3, c4, y)
sim_tree_fit <- rpart(
y ~ .,
data = train_tree_data
)
plot(as.party(sim_tree_fit))
varImp(sim_tree_fit, scale = FALSE)
## Overall
## c1 0.06361753
## c2 0.17198751
## c3 0.11809556
## c4 0.13957752
The right model concentrates importance on a few strong predictors because it’s learning more aggressively, while the left model spreads it out due to more cautious, exploratory learning.
The left model (with lower bagging fraction and shrinkage values) is likely to be more predictive on other samples. This is because it distributes importance across a broader set of predictors, which suggests it’s capturing more patterns in the data and is less likely to overfit. In contrast, the right model focuses too heavily on just a few variables, which increases the risk of overfitting to the training data and performing poorly on new, unseen data.
So, the left model may generalize better and be more reliable for predicting other samples.
Increasing the interaction depth would likely make the slope of predictor importance even steeper for both models. This means the top predictors would stand out even more, while the less important ones would fade further into the background. With deeper interactions, the model is allowed to capture more complex relationships between variables, which can cause it to lean even harder on the strongest predictors. As a result, the gap between the most and least important features would grow, especially in a model like the one on the right that’s already focused on just a few key variables.
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(ChemicalManufacturingProcess)
sum(is.na(ChemicalManufacturingProcess[, -c(1)]))
## [1] 106
# Removing entries with low variance
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]
# Filling NAs using KNN
knn_impute <- preProcess(
ChemicalManufacturingProcess[, -c(1)],
method = "knnImpute"
)
cmp_independent <- predict(
knn_impute,
ChemicalManufacturingProcess[, -c(1)]
)
cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop = FALSE]
sum(is.na(cmp_independent[, -c(1)]))
## [1] 0
set.seed(19940211)
# Partition the data into a sample of 80% of the full dataset
cmp_train_rows <- createDataPartition(
cmp_independent$BiologicalMaterial01,
p = 0.8,
list = FALSE
)
# Use the sample to create a training dataset
cmp_train_ind <- cmp_independent[cmp_train_rows, ]
cmp_train_dep <- cmp_dependent[cmp_train_rows]
# Use the sample to create a testing dataset
cmp_test_ind <- cmp_independent[-cmp_train_rows, ]
cmp_test_dep <- cmp_dependent[-cmp_train_rows]
cmp_stree_fit <- train(
cmp_train_ind,
cmp_train_dep,
method = "rpart2",
tuneLength = 10,
trControl = global_tr_control
)
cmp_stree_pred <- predict(
cmp_stree_fit,
cmp_test_ind
)
postResample(
cmp_stree_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.4827649 0.5008663 1.1742636
library(ipred)
cmp_bag_fit <- ipredbagg(
cmp_train_dep,
cmp_train_ind
)
cmp_bag_pred <- predict(
cmp_bag_fit,
cmp_test_ind
)
postResample(
cmp_bag_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.2790524 0.6830667 0.9635714
library(randomForest)
cmp_randforest_fit <- randomForest(
cmp_train_ind,
cmp_train_dep,
importance = TRUE,
ntree = 1000
)
cmp_randforest_pred <- predict(
cmp_randforest_fit,
cmp_test_ind
)
postResample(
cmp_randforest_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.3094566 0.6734821 0.9844426
boosted_animals <- train(
cmp_train_ind,
cmp_train_dep,
method = "gbm",
tuneGrid = boost_tree_grid,
verbose = FALSE,
trControl = global_tr_control
)
boosted_pred <- predict(
boosted_animals,
cmp_test_ind
)
postResample(
boosted_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## 1.3748034 0.5648721 1.0478236
cmp_cubist_fit <- cubist(
cmp_train_ind,
cmp_train_dep
)
cmp_cubist_pred <- predict(
cmp_cubist_fit,
cmp_train_ind
)
postResample(
cmp_cubist_pred,
cmp_test_dep
)
## RMSE Rsquared MAE
## NA 0.004342594 NA
The bagged tree model gave the best overall performance. It had the lowest RMSE (1.28), lowest MAE (0.96), and the highest R² (0.68) compared to the other models. While random forest came close, and boosted trees did reasonably well, the bagged tree model was the most consistent across all metrics. Cubist performed the worst with almost no predictive power.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
varImp(cmp_bag_fit) |>
arrange(desc(Overall)) |>
head(10)
## Overall
## BiologicalMaterial03 0.9034354
## ManufacturingProcess32 0.8108736
## ManufacturingProcess17 0.7102317
## BiologicalMaterial11 0.7071977
## BiologicalMaterial12 0.7003613
## ManufacturingProcess31 0.6280937
## ManufacturingProcess09 0.6146938
## ManufacturingProcess06 0.5380744
## BiologicalMaterial05 0.5254365
## BiologicalMaterial09 0.4977530
The most important predictors in the optimal tree-based model are a mix of both biological and manufacturing process variables, with a slight edge toward the biological ones. BiologicalMaterial03, 11, and 12 all rank high. Overall, the top 10 list is pretty balanced between the two types. Compared to the top predictors from the best linear and nonlinear models, there’s likely some overlap, but the tree model may capture more complex interactions that the others miss.
library(rpart.plot)
rpart.plot(
cmp_stree_fit$finalModel,
type = 4
)
Yes, this tree plot provides helpful insight into how specific
biological and process variables relate to yield. For example, the first
split on ManufacturingProcess32 shows it plays a major role in
predicting outcomes. As you move down the tree, variables like
ManufacturingProcess11, BiologicalMaterial12, and BiologicalMaterial01
continue to appear in key splits, reinforcing their importance. The
values in the terminal nodes also give a clear sense of how different
combinations of these variables influence the predicted yield. Overall,
the tree structure offers a more interpretable, step-by-step view of how
certain predictor thresholds impact yield.
stopCluster(cluster)
registerDoSEQ()