Homework9

8.1

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## 载入需要的程序包：ggplot2

## 
## 载入程序包：'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## 载入需要的程序包：lattice

set.seed(200)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)

##          Overall
## V1   8.605365900
## V2   6.831259165
## V3   0.741534943
## V4   7.883384091
## V5   2.244750293
## V6   0.136054182
## V7   0.055950944
## V8  -0.068195812
## V9   0.003196175
## V10 -0.054705900

V6 (0.136), V7 (0.055), and V9 (0.003) have minimal importance, consistent with our expectations for noise variables. V8 (-0.068) and V10 (-0.054) show negative importance, suggesting they could slightly hinder model performance, likely due to random noise. Uninformative variables (V6–V10) are effectively disregarded, as their importance scores are close to zero. The negative values for V8 and V10 are insignificant and likely stem from sampling variability.

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

simulated2 <- simulated
simulated2$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)

## [1] 0.9476651

set.seed(200)
model2 <- randomForest(
  y ~ ., 
  data = simulated2,
  importance = TRUE,
  ntree = 1000
)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)

##                Overall
## V1          5.26426006
## V2          6.07175433
## V3          0.51040749
## V4          6.80567764
## V5          2.18127427
## V6          0.29434454
## V7         -0.01490498
## V8         -0.09461681
## V9          0.01769244
## V10        -0.07631531
## duplicate1  4.28678654

The variable importance scores from the random forest model indicate that the addition of a highly correlated predictor (duplicate1) diminished the importance of the original variable (V1), reducing its score from 8.60 to 5.26. At the same time, duplicate1 captured a significant portion of the importance with a score of 4.29. This situation illustrates how random forests allocate importance among correlated features. Despite this redistribution, the combined influence of V1 and duplicate1 was still close to V1’s original importance score.

Additionally, other informative predictors, such as V2 and V4, retained high importance, while uninformative variables (V6–V10) continued to show almost no impact. This behavior highlights a key limitation of traditional random forests: correlated predictors can obscure the true significance of features.

Do these importances show the same pattern as the traditional random forest model?

library(party)
set.seed(200)
cforest_model <- cforest(
  y ~ ., 
  data = simulated,
  controls = cforest_unbiased(ntree = 1000)
)
cforest_imp <- varimp(cforest_model, conditional = TRUE)
print(cforest_imp)

##           V1           V2           V3           V4           V5           V6 
##  5.475491657  5.158601036  0.032690391  6.537084249  1.210913336  0.004670879 
##           V7           V8           V9          V10 
##  0.002755602 -0.009000983 -0.017799337 -0.006333743

The comparison between traditional random forests and conditional inference forests (Cforest) reveals distinct differences in how variable importance is assessed. The conventional random forest model tends to overestimate the importance of correlated predictors; for example, the importance score for V1 is 8.61 compared to Cforest’s score of 5.48. In contrast, Cforest’s conditional importance approach provides more conservative estimates that account for feature correlations more effectively.

Both models successfully identify V4, V2, and V1 as key predictors while effectively disregarding noise variables (V6-V10). However, Cforest reduces the importance scores of these key predictors even further, bringing them closer to zero. The conditional inference method demonstrates superior handling of correlated features, notably decreasing V1’s importance by 36% compared to the traditional method, while still robustly detecting truly informative variables.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
library(caret)

set.seed(200)
gbm_model <- gbm(
  y ~ .,
  data = simulated,
  distribution = "gaussian",
  n.trees = 1000,
  interaction.depth = 3,
  shrinkage = 0.1,
  cv.folds = 5,
  verbose = FALSE
)

gbm_imp <- summary(gbm_model, plotit = FALSE)
print(gbm_imp)

##     var   rel.inf
## V4   V4 28.172651
## V1   V1 22.109061
## V2   V2 21.967069
## V5   V5 11.388556
## V3   V3  8.008256
## V7   V7  1.999360
## V8   V8  1.838171
## V6   V6  1.781569
## V9   V9  1.602491
## V10 V10  1.132818

library(Cubist)
cubist_model <- cubist(
  x = simulated[, -ncol(simulated)],
  y = simulated$y,
  committees = 50
)

cubist_imp <- varImp(cubist_model)
print(cubist_imp)

##     Overall
## V1     71.0
## V3     44.5
## V2     57.5
## V4     48.5
## V5     35.0
## V6     15.5
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

The GBM and Cubist models reinforce the fundamental patterns observed in previous analyses using random forests. Both models consistently identify V1, V2, and V4 as the most influential predictors while minimizing the impact of noise variables (V6-V10). However, they demonstrate distinct behaviors: the GBM model provides detailed importance scores (e.g., V4 = 28.2, V1 = 22.1) and retains some signals from noise variables, while Cubist takes a more aggressive approach by completely eliminating V7-V10 and sharply prioritizing V1 with a score of 71.0 through its rule-based system. Additionally, both models show less sensitivity to predictor correlation compared to traditional random forests. GBM tends to favor one correlated variable slightly, whereas Cubist selects only single representatives from correlated predictors.

8.2

Use a simulation to show tree bias with different granularities.

Simulate data

library(rpart)
library(patchwork)
library(tidyverse)

set.seed(1048)

# Simulate data with a true linear relationship
n <- 1000
x <- runif(n, 0, 10)
y <- 2*x + rnorm(n, 0, 2)
data <- data.frame(x, y)

Fit trees with different depths (granularities)

tree_depth1 <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = 1))
tree_depth2 <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = 2))
tree_depth3 <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = 3))
tree_depth5 <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = 5))
tree_depth10 <- rpart(y ~ x, data = data, control = rpart.control(maxdepth = 10))

Generate predictions

plot_data <- data.frame(x = seq(0, 10, length.out = 200))
plot_data$true <- 2 * plot_data$x
plot_data$depth1 <- predict(tree_depth1, newdata = plot_data)
plot_data$depth2 <- predict(tree_depth2, newdata = plot_data)
plot_data$depth3 <- predict(tree_depth3, newdata = plot_data)
plot_data$depth5 <- predict(tree_depth5, newdata = plot_data)
plot_data$depth10 <- predict(tree_depth10, newdata = plot_data)

Create plots

p1 <- ggplot(plot_data) +
  geom_point(data = data, aes(x, y), alpha = 0.3) +
  geom_line(aes(x, true), color = "red", linewidth = 1) +
  geom_step(aes(x, depth1), color = "blue", linewidth = 1) +
  ggtitle("Tree Depth = 1") +
  theme_minimal()

p2 <- ggplot(plot_data) +
  geom_point(data = data, aes(x, y), alpha = 0.3) +
  geom_line(aes(x, true), color = "red", linewidth = 1) +
  geom_step(aes(x, depth2), color = "blue", linewidth = 1) +
  ggtitle("Tree Depth = 2") +
  theme_minimal()

p3 <- ggplot(plot_data) +
  geom_point(data = data, aes(x, y), alpha = 0.3) +
  geom_line(aes(x, true), color = "red", linewidth = 1) +
  geom_step(aes(x, depth3), color = "blue", linewidth = 1) +
  ggtitle("Tree Depth = 3") +
  theme_minimal()

p4 <- ggplot(plot_data) +
  geom_point(data = data, aes(x, y), alpha = 0.3) +
  geom_line(aes(x, true), color = "red", linewidth = 1) +
  geom_step(aes(x, depth5), color = "blue", linewidth = 1) +
  ggtitle("Tree Depth = 5") +
  theme_minimal()

p5 <- ggplot(plot_data) +
  geom_point(data = data, aes(x, y), alpha = 0.3) +
  geom_line(aes(x, true), color = "red", linewidth = 1) +
  geom_step(aes(x, depth10), color = "blue", linewidth = 1) +
  ggtitle("Tree Depth = 10") +
  theme_minimal()

# Combine plots
combined_plot <- wrap_plots(
  p1, p2, p3, p4, p5,
  ncol = 2,
  nrow = 3,
  byrow = TRUE
) + 
  plot_annotation(
    title = "Tree Bias at Different Granularities (Depth Levels)",
    subtitle = "Red line = True linear relationship, Blue line = Tree approximation"
  )

print(combined_plot)

8.3

) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right, which is characterized by a high bagging fraction of 0.9 and a learning rate of 0.9, places significant emphasis on only a select few predictors. This is primarily a result of the elevated learning rate, which allows each tree to make substantial updates to the model, rapidly highlighting the dominant predictors. Additionally, the high bagging fraction means that each tree accesses the majority of the data, thereby diminishing diversity and leading to the repeated selection of the same predictors.

In contrast, the model on the left, featuring a low bagging fraction of 0.1 and a learning rate of 0.1, distributes importance more evenly across predictors. The lower learning rate requires numerous incremental adjustments, allowing weaker predictors to gradually contribute over time. Moreover, the low bagging fraction means each tree processes only 10% of the data, which enhances diversity and increases the likelihood of a broader range of predictors being selected.

Which model do you think would be more predictive of other samples?

The model on the left, which has fewer parameters, is likely to generalize better for the following reasons: a slower learning rate helps prevent overfitting; the greater diversity of predictors used makes the model more robust; and the other model may rely too heavily on a few predictors, increasing the risk of overfitting.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the depth of interaction (allowing for deeper trees with additional interactions) would result in steeper slopes of variable importance for both models. This adjustment enables the models to capture more complex relationships, which could elevate the prominence of a select few key predictors. Moreover, the impact would be more pronounced in the model with higher parameters, as it typically emphasizes a limited number of predictors.

8.7

# Load libraries
library(AppliedPredictiveModeling)

## Warning: 程序包'AppliedPredictiveModeling'是用R版本4.4.3 来建造的

library(gbm)

# Load and impute data (same as 6.3 and 7.5)
data(ChemicalManufacturingProcess)
processed <- preProcess(ChemicalManufacturingProcess, 
                       method = c("knnImpute", "center", "scale", "nzv"))
imputedData <- predict(processed, ChemicalManufacturingProcess)

# Split data (80% train, 20% test)
set.seed(1225)
trainIndex <- createDataPartition(imputedData$Yield, p = 0.8, list = FALSE)
trainData <- imputedData[trainIndex, ]
testData <- imputedData[-trainIndex, ]

# Set up repeated cross-validation
ctrl <- trainControl(method = "repeatedcv", 
                    number = 10, 
                    repeats = 5)

# (a) Train several tree-based models to find optimal one
set.seed(1225)
models <- list()

# Single CART model
models$cart <- train(Yield ~ ., 
                     data = trainData,
                     method = "rpart",
                     tuneLength = 10,
                     trControl = ctrl)

# Bagged CART
models$bagging <- train(Yield ~ ., 
                        data = trainData,
                        method = "treebag",
                        trControl = ctrl)

# Random Forest
models$rf <- train(Yield ~ ., 
                   data = trainData,
                   method = "rf",
                   tuneLength = 5,
                   importance = TRUE,
                   trControl = ctrl)

# Gradient Boosting Machine (GBM)
models$gbm <- train(Yield ~ ., 
                    data = trainData,
                    method = "gbm",
                    tuneLength = 5,
                    verbose = FALSE,
                    trControl = ctrl)

# Compare model performance
results <- resamples(models)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: cart, bagging, rf, gbm 
## Number of resamples: 50 
## 
## MAE 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart    0.3738558 0.5051677 0.5824203 0.5942638 0.6854522 0.9307899    0
## bagging 0.3275683 0.4310399 0.4953922 0.5059690 0.5635101 0.8398362    0
## rf      0.2351597 0.4012545 0.4788849 0.4715485 0.5467272 0.6837675    0
## gbm     0.2129303 0.3894843 0.4569945 0.4697483 0.5617412 0.7006704    0
## 
## RMSE 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart    0.4528563 0.6195245 0.7703338 0.7642517 0.8858285 1.2323528    0
## bagging 0.4067777 0.5594586 0.6458706 0.6504269 0.7008623 1.1211873    0
## rf      0.2942644 0.5038615 0.5996890 0.5955457 0.6738828 0.8724546    0
## gbm     0.3095891 0.4930365 0.5690963 0.5823388 0.6873379 0.8313220    0
## 
## Rsquared 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart    0.03161651 0.3165296 0.4421560 0.4353097 0.5861159 0.7832887    0
## bagging 0.17267027 0.4841941 0.5410597 0.5644877 0.6856403 0.8671349    0
## rf      0.26108907 0.5817642 0.6809144 0.6618984 0.7791197 0.9332155    0
## gbm     0.22076309 0.5191066 0.6724145 0.6411409 0.7618929 0.8754893    0

# Evaluate on test set
testResults <- data.frame(
  CART = postResample(predict(models$cart, testData), testData$Yield),
  Bagging = postResample(predict(models$bagging, testData), testData$Yield),
  RF = postResample(predict(models$rf, testData), testData$Yield),
  GBM = postResample(predict(models$gbm, testData), testData$Yield)
)

print(testResults)

##               CART   Bagging        RF       GBM
## RMSE     0.8497570 0.7231121 0.7135570 0.6305881
## Rsquared 0.4265936 0.6134135 0.6405333 0.6822244
## MAE      0.6631797 0.5446428 0.5363279 0.4544352

Which tree-based regression model gives the optimal resampling and test set performance?

The optimal model identified is the Random Forest (RF). Key evidence from the resampling results indicates that RF shows the lowest Mean Absolute Error (MAE) with a median of 0.4789 and a mean of 0.4715. Additionally, RF outperforms in terms of Root Mean Squared Error (RMSE), recording the lowest median of 0.5997 and a mean of 0.5955. When it comes to R-squared, RF boasts the highest median at 0.6809 and a mean of 0.6619. Although GBM demonstrates comparable performance, offering slightly better MAE but worse RMSE and R square values, the Random Forest consistently excels across all metrics. Bagging follows closely in second place, while CART ranks as the least effective model.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

# Get variable importance from the optimal model (RF)
var_imp <- varImp(models$rf, scale = FALSE)
top_predictors <- var_imp$importance %>% 
  arrange(desc(Overall)) %>% 
  head(10)
var_imp

## rf variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  21.538
## ManufacturingProcess09  10.607
## BiologicalMaterial06    10.037
## BiologicalMaterial12     9.263
## BiologicalMaterial03     9.228
## ManufacturingProcess36   8.034
## ManufacturingProcess13   7.808
## ManufacturingProcess11   7.076
## ManufacturingProcess33   7.065
## BiologicalMaterial04     6.966
## BiologicalMaterial02     6.679
## BiologicalMaterial11     6.424
## ManufacturingProcess31   5.999
## ManufacturingProcess17   5.923
## ManufacturingProcess39   5.824
## BiologicalMaterial08     5.726
## ManufacturingProcess28   5.656
## ManufacturingProcess34   5.395
## BiologicalMaterial01     5.288
## BiologicalMaterial05     4.850

# Plot
ggplot(top_predictors, aes(x = reorder(rownames(top_predictors), Overall), y = Overall)) +
  geom_col() +
  coord_flip() +
  labs(x = "Predictor", y = "Importance", title = "Top 10 Predictors in RF Model")

The top 10 important predictors in the Random Forest model are a mix of process and biological variables, but process variables dominate. Linear models assume global, additive effects, favoring biologically meaningful variables. Tree-based models (like RF) capture complex interactions and thresholds, making manufacturing process variables (which often have nonlinear impacts) more influential.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

# Partial dependence for top predictor
library(pdp)

## 
## 载入程序包：'pdp'

## The following object is masked from 'package:purrr':
## 
##     partial

partial(models$rf, pred.var = rownames(top_predictors)[1], train = trainData) %>% 
  autoplot() + 
  labs(title = "Partial Dependence: ManufacturingProcess32")

The partial dependence plots reveal nonlinear relationships (e.g., thresholds or peaks) between key process variables and yield. For example, ManufacturingProcess32 show diminishing returns beyond a certain value. While no single tree exists for RF, partial dependence shows process variables have nonlinear, threshold-based effects on yield—knowledge not easily captured by linear models.