Homework 9

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                        importance = TRUE,
                        ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predic- tors (V6 – V10)?

No. The random forest should place V1–V5 at the top of the importance ranking, while V6–V10 should have very low (often near-zero) importance. Any small nonzero importance for V6–V10 is usually just noise from the randomness of the forest, not meaningful signal.

b. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

library(randomForest)
library(caret)

set.seed(123)

simulated$duplicate1 <- simulated$V1 + rnorm(nrow(simulated)) * 0.1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9504983

model2 <- randomForest(
  y ~ .,
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                 Overall
## V1          5.171852887
## V2          6.180268049
## V3          0.557357350
## V4          7.208879483
## V5          1.857522008
## V6          0.298116908
## V7          0.055392827
## V8         -0.067892749
## V9         -0.083381598
## V10         0.004140843
## duplicate1  4.132078053

Yes. After adding duplicate1 (highly correlated with V1), the random forest assigns a high importance to duplicate1 (4.13) while V1 remains important (5.17), showing that the model now splits predictive “credit” between the two correlated predictors. This means V1’s importance is effectively reduced compared to when it was the only strong proxy for that information. If you add more predictors that are also highly correlated with V1, the same effect continues—importance gets spread across the correlated group, so each individual variable looks less important even though the underlying signal is the same.

C Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

set.seed(123)

cf <- cforest(
  y ~ .,
  data = simulated,
  controls = cforest_unbiased(ntree = 1000, mtry = floor(sqrt(ncol(simulated) - 1)))
)

imp_traditional <- varimp(cf, conditional = FALSE)
imp_conditional <- varimp(cf, conditional = TRUE)

sort(imp_traditional, decreasing = TRUE)

##          V4          V2          V1  duplicate1          V5          V3 
##  5.88307548  5.04185805  4.87666042  4.30497985  1.75398229  0.10914836 
##          V7          V9          V6         V10          V8 
##  0.01427750  0.00270447 -0.01226969 -0.02344100 -0.09078758

sort(imp_conditional, decreasing = TRUE)

##           V4           V2   duplicate1           V1           V5           V3 
##  4.377312180  3.797465721  2.103651333  1.946961810  1.100275857  0.093029881 
##           V6           V9           V7           V8          V10 
## -0.002757279 -0.003823935 -0.009743235 -0.026716395 -0.039213577

Yes. The cforest importances show the same main pattern as the traditional random forest: V1–V5 are most important and V6–V10 contribute little. The conditional importance just adjusts for correlation, so correlated predictors share less inflated importance.

D. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Yes. With boosted trees and Cubist, the same general pattern usually appears: the informative predictors (V1–V5) rank highest and the uninformative predictors (V6–V10) stay near the bottom. As with random forests, adding predictors that are highly correlated with an informative variable can cause importance to be shared or the ranking to shift within that correlated group, but the noise predictors still do not become truly important.

8.2 Use a simulation to show tree bias with different granularities.

set.seed(123)
n <- 500


X_cont <- rnorm(n)

X_coarse <- cut(X_cont, breaks = quantile(X_cont, probs = seq(0, 1, length.out = 6)),
                include.lowest = TRUE, labels = FALSE)

noise_cont   <- replicate(5, rnorm(n))
noise_coarse <- replicate(5, sample(1:5, n, replace = TRUE))

colnames(noise_cont) <- paste0("Ncont", 1:5)
colnames(noise_coarse) <- paste0("Ncoarse", 1:5)

y <- 2*X_cont + 2*as.numeric(scale(X_coarse)) + rnorm(n, sd = 1)

dat <- data.frame(
  y = y,
  X_cont = X_cont,
  X_coarse = X_coarse,
  noise_cont,
  noise_coarse
)

library(randomForest)
rf <- randomForest(y ~ ., data = dat, ntree = 1000, importance = TRUE)

imp_rf <- importance(rf, type = 1)  # %IncMSE
imp_rf <- sort(imp_rf[,1], decreasing = TRUE)
imp_rf

##     X_cont   X_coarse     Ncont5   Ncoarse5   Ncoarse2   Ncoarse3     Ncont3 
## 51.7753899 39.2573131  1.5551348  1.4271762 -0.6363005 -0.8959828 -1.0815222 
##     Ncont4     Ncont1   Ncoarse4     Ncont2   Ncoarse1 
## -1.4467764 -1.4737580 -2.0577915 -2.2700067 -3.1315046

barplot(imp_rf, las = 2, main = "RF Variable Importance (%IncMSE)")

library(gbm)
gbm_fit <- gbm(y ~ ., data = dat, distribution = "gaussian",
               n.trees = 2000, interaction.depth = 3,
               shrinkage = 0.01, bag.fraction = 0.7,
               train.fraction = 1.0, verbose = FALSE)

imp_gbm <- summary(gbm_fit, plotit = FALSE)
imp_gbm[order(-imp_gbm$rel.inf), ]

##               var    rel.inf
## X_cont     X_cont 90.9829559
## X_coarse X_coarse  4.3567738
## Ncont1     Ncont1  0.7965436
## Ncont4     Ncont4  0.7852258
## Ncont3     Ncont3  0.7721657
## Ncont5     Ncont5  0.7385458
## Ncont2     Ncont2  0.5984015
## Ncoarse3 Ncoarse3  0.2446533
## Ncoarse4 Ncoarse4  0.2292532
## Ncoarse5 Ncoarse5  0.1887183
## Ncoarse2 Ncoarse2  0.1569205
## Ncoarse1 Ncoarse1  0.1498427

barplot(setNames(imp_gbm$rel.inf, imp_gbm$var)[order(-imp_gbm$rel.inf)],
        las = 2, main = "GBM Relative Influence")

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

A. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right concentrates importance on just a few predictors because using a high learning rate and a high bagging fraction makes boosting learn very aggressively from almost the full dataset each iteration, so it repeatedly splits on the strongest predictors and quickly explains most of the signal with them. In contrast, the model on the left has a low learning rate and low bagging fraction, which slows down learning and adds more randomness from using smaller subsamples, giving other predictors more opportunities to contribute across many trees. As a result, importance is spread across a wider set of variables instead of being dominated by only the top few.

B.Which model do you think would be more predictive of other samples?

The model on the left is likely to be more predictive of other samples because the smaller learning rate and bagging fraction lead to a more gradual, stable fitting process that reduces overfitting. By spreading importance across more predictors and learning more slowly, this model is better able to generalize to new data than the more aggressive model on the right, which is more likely to overfit to the strongest predictors in the training sample.

C.How would increasing interaction depth affect the slope of predictor im-portance for either model in Fig. 8.24?

Increasing the interaction depth would tend to flatten the slope of predictor importance for either model. With deeper interactions, trees can capture more complex relationships involving combinations of predictors rather than relying heavily on just the strongest main effects. As a result, importance is spread across more variables, reducing the dominance of the top predictors and making the importance distribution less steep compared to models with shallow interaction depth.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

dat <- ChemicalManufacturingProcess

names(dat)[1:10]

##  [1] "Yield"                "BiologicalMaterial01" "BiologicalMaterial02"
##  [4] "BiologicalMaterial03" "BiologicalMaterial04" "BiologicalMaterial05"
##  [7] "BiologicalMaterial06" "BiologicalMaterial07" "BiologicalMaterial08"
## [10] "BiologicalMaterial09"

library(caret)
set.seed(123)

y_name <- "Yield" 

x_names <- setdiff(names(dat), y_name)

idx <- createDataPartition(dat[[y_name]], p = 0.80, list = FALSE)
train_dat <- dat[idx, ]
test_dat  <- dat[-idx, ]

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)

pp_rf   <- c("medianImpute")                 # RF doesn't need scaling
pp_gbm  <- c("medianImpute")                 # GBM usually fine without scaling
pp_cub  <- c("medianImpute", "center", "scale")

set.seed(123)
fit_rf <- train(
  x = train_dat[, x_names],
  y = train_dat[[y_name]],
  method = "rf",
  trControl = ctrl,
  preProcess = pp_rf,
  tuneLength = 10,
  metric = "RMSE"
)

fit_gbm <- train(
  x = train_dat[, x_names],
  y = train_dat[[y_name]],
  method = "gbm",
  trControl = ctrl,
  preProcess = pp_gbm,
  tuneLength = 15,
  metric = "RMSE",
  verbose = FALSE
)

fit_cubist <- train(
  x = train_dat[, x_names],
  y = train_dat[[y_name]],
  method = "cubist",
  trControl = ctrl,
  preProcess = pp_cub,
  tuneLength = 15,
  metric = "RMSE"
)

res <- resamples(list(RF = fit_rf, GBM = fit_gbm, Cubist = fit_cubist))
summary(res)

## 
## Call:
## summary.resamples(object = res)
## 
## Models: RF, GBM, Cubist 
## Number of resamples: 50 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RF     0.4676725 0.7022000 0.8805879 0.8680449 1.0006205 1.488103    0
## GBM    0.3827541 0.6750214 0.8033993 0.8404696 1.0061369 1.469875    0
## Cubist 0.4743240 0.6511300 0.7450497 0.7792669 0.9328867 1.264084    0
## 
## RMSE 
##             Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF     0.5065772 0.8844017 1.094949 1.107242 1.316293 2.010253    0
## GBM    0.5691844 0.8995812 1.057580 1.095640 1.286162 1.746316    0
## Cubist 0.6206644 0.8507692 0.959520 1.002959 1.137632 1.786479    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF     0.2950537 0.5739084 0.6717680 0.6716376 0.7798828 0.9201389    0
## GBM    0.2945332 0.6136996 0.6976376 0.6701092 0.7578439 0.8947421    0
## Cubist 0.4040896 0.6568010 0.7226099 0.7126169 0.7700812 0.9047800    0

rmse <- function(truth, pred) sqrt(mean((truth - pred)^2))

test_truth <- test_dat[[y_name]]
test_perf <- data.frame(
  Model = c("RF", "GBM", "Cubist"),
  Test_RMSE = c(
    rmse(test_truth, predict(fit_rf, test_dat)),
    rmse(test_truth, predict(fit_gbm, test_dat)),
    rmse(test_truth, predict(fit_cubist, test_dat))
  )
)

test_perf[order(test_perf$Test_RMSE), ]

##    Model Test_RMSE
## 3 Cubist  1.040890
## 2    GBM  1.146741
## 1     RF  1.287363

a) Which tree-based regression model gives the optimal resampling and test set performance?

The “optimal” tree-based regression model is the one with the best resampling performance (usually the lowest cross-validated RMSE) and that also has the lowest RMSE on the test set. In most runs for this chemical manufacturing data, an ensemble tree method such as Random Forest or Gradient Boosting (GBM) tends to perform best because it captures nonlinear effects and interactions, but your final answer should name whichever model in your results has the lowest CV RMSE and test RMSE.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
library(rpart.plot)

set.seed(123)


data(ChemicalManufacturingProcess, package = "AppliedPredictiveModeling") # only if you have it
dat <- ChemicalManufacturingProcess

set.seed(123)
idx <- createDataPartition(dat$Yield, p = 0.80, list = FALSE)
train_dat <- dat[idx, ]

tree_fit <- rpart(Yield ~ ., data = train_dat, method = "anova")

par(mar = c(1, 1, 1, 1))  # bottom, left, top, right
plot(tree_fit)

text(tree_fit, use.n = TRUE)

# Plot the tree
rpart.plot(tree_fit, type = 2, extra = 101, fallen.leaves = TRUE)

# Terminal-node membership for each training observation
leaf_id <- tree_fit$where

# Boxplots of Yield by terminal node (distribution in each leaf)
boxplot(
  train_dat$Yield ~ as.factor(leaf_id),
  xlab = "Terminal node",
  ylab = "Yield",
  main = "Yield distribution within terminal nodes"
)

This view of the data provides limited additional biological or process insight. While the tree highlights which predictors are most influential and identifies threshold values associated with changes in yield, these relationships are data-driven and do not directly reflect underlying biological mechanisms. The results are therefore more useful for prediction and variable screening than for gaining mechanistic understanding of how the predictors influence yield.