Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
No. The random forest should place V1–V5 at the top of the importance ranking, while V6–V10 should have very low (often near-zero) importance. Any small nonzero importance for V6–V10 is usually just noise from the randomness of the forest, not meaningful signal.
library(party)
set.seed(123)
cf <- cforest(
y ~ .,
data = simulated,
controls = cforest_unbiased(ntree = 1000, mtry = floor(sqrt(ncol(simulated) - 1)))
)
imp_traditional <- varimp(cf, conditional = FALSE)
imp_conditional <- varimp(cf, conditional = TRUE)
sort(imp_traditional, decreasing = TRUE)## V4 V2 V1 duplicate1 V5 V3
## 5.88307548 5.04185805 4.87666042 4.30497985 1.75398229 0.10914836
## V7 V9 V6 V10 V8
## 0.01427750 0.00270447 -0.01226969 -0.02344100 -0.09078758
## V4 V2 duplicate1 V1 V5 V3
## 4.377312180 3.797465721 2.103651333 1.946961810 1.100275857 0.093029881
## V6 V9 V7 V8 V10
## -0.002757279 -0.003823935 -0.009743235 -0.026716395 -0.039213577
Yes. The cforest importances show the same main pattern as the traditional random forest: V1–V5 are most important and V6–V10 contribute little. The conditional importance just adjusts for correlation, so correlated predictors share less inflated importance.
Yes. With boosted trees and Cubist, the same general pattern usually appears: the informative predictors (V1–V5) rank highest and the uninformative predictors (V6–V10) stay near the bottom. As with random forests, adding predictors that are highly correlated with an informative variable can cause importance to be shared or the ranking to shift within that correlated group, but the noise predictors still do not become truly important.
set.seed(123)
n <- 500
X_cont <- rnorm(n)
X_coarse <- cut(X_cont, breaks = quantile(X_cont, probs = seq(0, 1, length.out = 6)),
include.lowest = TRUE, labels = FALSE)
noise_cont <- replicate(5, rnorm(n))
noise_coarse <- replicate(5, sample(1:5, n, replace = TRUE))
colnames(noise_cont) <- paste0("Ncont", 1:5)
colnames(noise_coarse) <- paste0("Ncoarse", 1:5)
y <- 2*X_cont + 2*as.numeric(scale(X_coarse)) + rnorm(n, sd = 1)
dat <- data.frame(
y = y,
X_cont = X_cont,
X_coarse = X_coarse,
noise_cont,
noise_coarse
)library(randomForest)
rf <- randomForest(y ~ ., data = dat, ntree = 1000, importance = TRUE)
imp_rf <- importance(rf, type = 1) # %IncMSE
imp_rf <- sort(imp_rf[,1], decreasing = TRUE)
imp_rf## X_cont X_coarse Ncont5 Ncoarse5 Ncoarse2 Ncoarse3 Ncont3
## 51.7753899 39.2573131 1.5551348 1.4271762 -0.6363005 -0.8959828 -1.0815222
## Ncont4 Ncont1 Ncoarse4 Ncont2 Ncoarse1
## -1.4467764 -1.4737580 -2.0577915 -2.2700067 -3.1315046
library(gbm)
gbm_fit <- gbm(y ~ ., data = dat, distribution = "gaussian",
n.trees = 2000, interaction.depth = 3,
shrinkage = 0.01, bag.fraction = 0.7,
train.fraction = 1.0, verbose = FALSE)
imp_gbm <- summary(gbm_fit, plotit = FALSE)
imp_gbm[order(-imp_gbm$rel.inf), ]## var rel.inf
## X_cont X_cont 90.9829559
## X_coarse X_coarse 4.3567738
## Ncont1 Ncont1 0.7965436
## Ncont4 Ncont4 0.7852258
## Ncont3 Ncont3 0.7721657
## Ncont5 Ncont5 0.7385458
## Ncont2 Ncont2 0.5984015
## Ncoarse3 Ncoarse3 0.2446533
## Ncoarse4 Ncoarse4 0.2292532
## Ncoarse5 Ncoarse5 0.1887183
## Ncoarse2 Ncoarse2 0.1569205
## Ncoarse1 Ncoarse1 0.1498427
barplot(setNames(imp_gbm$rel.inf, imp_gbm$var)[order(-imp_gbm$rel.inf)],
las = 2, main = "GBM Relative Influence")The model on the right concentrates importance on just a few predictors because using a high learning rate and a high bagging fraction makes boosting learn very aggressively from almost the full dataset each iteration, so it repeatedly splits on the strongest predictors and quickly explains most of the signal with them. In contrast, the model on the left has a low learning rate and low bagging fraction, which slows down learning and adds more randomness from using smaller subsamples, giving other predictors more opportunities to contribute across many trees. As a result, importance is spread across a wider set of variables instead of being dominated by only the top few.
The model on the left is likely to be more predictive of other samples because the smaller learning rate and bagging fraction lead to a more gradual, stable fitting process that reduces overfitting. By spreading importance across more predictors and learning more slowly, this model is better able to generalize to new data than the more aggressive model on the right, which is more likely to overfit to the strongest predictors in the training sample.
Increasing the interaction depth would tend to flatten the slope of predictor importance for either model. With deeper interactions, trees can capture more complex relationships involving combinations of predictors rather than relying heavily on just the strongest main effects. As a result, importance is spread across more variables, reducing the dominance of the top predictors and making the importance distribution less steep compared to models with shallow interaction depth.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
dat <- ChemicalManufacturingProcess## [1] "Yield" "BiologicalMaterial01" "BiologicalMaterial02"
## [4] "BiologicalMaterial03" "BiologicalMaterial04" "BiologicalMaterial05"
## [7] "BiologicalMaterial06" "BiologicalMaterial07" "BiologicalMaterial08"
## [10] "BiologicalMaterial09"
idx <- createDataPartition(dat[[y_name]], p = 0.80, list = FALSE)
train_dat <- dat[idx, ]
test_dat <- dat[-idx, ]ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
pp_rf <- c("medianImpute") # RF doesn't need scaling
pp_gbm <- c("medianImpute") # GBM usually fine without scaling
pp_cub <- c("medianImpute", "center", "scale")set.seed(123)
fit_rf <- train(
x = train_dat[, x_names],
y = train_dat[[y_name]],
method = "rf",
trControl = ctrl,
preProcess = pp_rf,
tuneLength = 10,
metric = "RMSE"
)fit_gbm <- train(
x = train_dat[, x_names],
y = train_dat[[y_name]],
method = "gbm",
trControl = ctrl,
preProcess = pp_gbm,
tuneLength = 15,
metric = "RMSE",
verbose = FALSE
)fit_cubist <- train(
x = train_dat[, x_names],
y = train_dat[[y_name]],
method = "cubist",
trControl = ctrl,
preProcess = pp_cub,
tuneLength = 15,
metric = "RMSE"
)##
## Call:
## summary.resamples(object = res)
##
## Models: RF, GBM, Cubist
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.4676725 0.7022000 0.8805879 0.8680449 1.0006205 1.488103 0
## GBM 0.3827541 0.6750214 0.8033993 0.8404696 1.0061369 1.469875 0
## Cubist 0.4743240 0.6511300 0.7450497 0.7792669 0.9328867 1.264084 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.5065772 0.8844017 1.094949 1.107242 1.316293 2.010253 0
## GBM 0.5691844 0.8995812 1.057580 1.095640 1.286162 1.746316 0
## Cubist 0.6206644 0.8507692 0.959520 1.002959 1.137632 1.786479 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.2950537 0.5739084 0.6717680 0.6716376 0.7798828 0.9201389 0
## GBM 0.2945332 0.6136996 0.6976376 0.6701092 0.7578439 0.8947421 0
## Cubist 0.4040896 0.6568010 0.7226099 0.7126169 0.7700812 0.9047800 0
rmse <- function(truth, pred) sqrt(mean((truth - pred)^2))
test_truth <- test_dat[[y_name]]
test_perf <- data.frame(
Model = c("RF", "GBM", "Cubist"),
Test_RMSE = c(
rmse(test_truth, predict(fit_rf, test_dat)),
rmse(test_truth, predict(fit_gbm, test_dat)),
rmse(test_truth, predict(fit_cubist, test_dat))
)
)
test_perf[order(test_perf$Test_RMSE), ]## Model Test_RMSE
## 3 Cubist 1.040890
## 2 GBM 1.146741
## 1 RF 1.287363
The “optimal” tree-based regression model is the one with the best resampling performance (usually the lowest cross-validated RMSE) and that also has the lowest RMSE on the test set. In most runs for this chemical manufacturing data, an ensemble tree method such as Random Forest or Gradient Boosting (GBM) tends to perform best because it captures nonlinear effects and interactions, but your final answer should name whichever model in your results has the lowest CV RMSE and test RMSE.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
library(rpart)
library(rpart.plot)
set.seed(123)
data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling") # only if you have it
dat <- ChemicalManufacturingProcess
set.seed(123)
idx <- createDataPartition(dat$Yield, p = 0.80, list = FALSE)
train_dat <- dat[idx, ]
tree_fit <- rpart(Yield ~ ., data = train_dat, method = "anova")
par(mar = c(1, 1, 1, 1)) # bottom, left, top, right
plot(tree_fit)
text(tree_fit, use.n = TRUE)# Terminal-node membership for each training observation
leaf_id <- tree_fit$where
# Boxplots of Yield by terminal node (distribution in each leaf)
boxplot(
train_dat$Yield ~ as.factor(leaf_id),
xlab = "Terminal node",
ylab = "Yield",
main = "Yield distribution within terminal nodes"
)
This view of the data provides limited additional biological or process
insight. While the tree highlights which predictors are most influential
and identifies threshold values associated with changes in yield, these
relationships are data-driven and do not directly reflect underlying
biological mechanisms. The results are therefore more useful for
prediction and variable screening than for gaining mechanistic
understanding of how the predictors influence yield.