Data624 - HW 9

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE,  ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Answer: The random forest model did use the uninformative predictors (V6–V10), as they appear in the variable importance output. However, their contribution was minimal compared to the informative predictors (V1–V5). Variables V1, V2, and V4, in particular, had the highest importance scores, while V6–V10 had scores close to zero (and some even negative), indicating they were not significantly useful in predicting the outcome.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

This new predictor, duplicate1, is highly correlated with V1 (correlation = 0.95). We then refit the random forest model with this new variable included.

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

print(rfImp2)

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Answer: After adding duplicate1, a predictor highly correlated with V1 (correlation = 0.95), the importance score of V1 decreased significantly from 8.69 to 5.69. This happened because the model now has two predictors carrying similar information, and it splits the importance between them. If another correlated predictor were added, the importance score of V1 would likely decrease even further, distributing its contribution across all similar variables. This illustrates how random forests can dilute importance scores when predictors are highly collinear.

(c) Use the ‘cforest’ function in the party package to fit a random forest model using conditional inference trees. The party package function ‘varimp’ can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

set.seed(123)

cf_model <- cforest(y ~ .,   data = simulated[, 1:11])

imp_traditional <- varimp(cf_model, conditional = FALSE) # Traditional
imp_conditional <- varimp(cf_model, conditional = TRUE) # Conditional (Strobl et al.)

comparison <- cbind(
  Traditional = round(imp_traditional, 3),
  Conditional = round(imp_conditional, 3)
)
print(comparison)

##     Traditional Conditional
## V1        8.854       5.586
## V2        6.549       5.337
## V3        0.024       0.045
## V4        8.222       6.565
## V5        2.096       1.183
## V6        0.006       0.004
## V7        0.023       0.027
## V8       -0.047      -0.009
## V9        0.008       0.008
## V10      -0.024      -0.030

print(rfImp1)

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Answer: I used the ‘cforest()’ function from the party package to fit a random forest model using conditional inference trees. The ‘varimp()’ function was then used to calculate variable importance in two ways:

Traditional importance (conditional = FALSE)
Conditional importance (conditional = TRUE), which adjusts for correlation as described by Strobl et al. (2007).

Both approaches identified V1, V2, and V4 as the most important predictors, consistent with the traditional random forest model from part (a). However, the conditional importance scores were lower, especially for V1 and V2, which are likely correlated. This suggests that the conditional method provides a more balanced and less biased assessment of each variable’s unique contribution.

In summary, while both methods highlight similar top variables, the conditional approach gives a more accurate picture in the presence of multicollinearity.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
library(Cubist)

1. Boosted Trees (GBM):

X <- simulated[, 1:10]  # predictors V1 to V10
y <- simulated$y        # outcome

set.seed(456)
gbm_grid <- expand.grid(
  .interaction.depth = c(1, 3, 5),
  .n.trees = c(100, 300, 500),
  .shrinkage = c(0.01, 0.1),
  .n.minobsinnode = 10  # default
)

gbm_model <- train(
  x = X, y = y,
  method = "gbm",
  tuneGrid = gbm_grid,
  trControl = trainControl(method = "cv"),
  verbose = FALSE
)

varImp(gbm_model)

## gbm variable importance
## 
##     Overall
## V1  100.000
## V4   91.234
## V2   73.568
## V5   42.478
## V3   22.918
## V6    4.791
## V7    4.055
## V8    1.858
## V9    1.555
## V10   0.000

2. Cubist Model:

set.seed(456)
cubist_model <- train(
  x = X, y = y,
  method = "cubist",
  trControl = trainControl(method = "cv")
)

varImp(cubist_model)

## cubist variable importance
## 
##     Overall
## V1   100.00
## V2    77.30
## V4    70.92
## V5    67.38
## V3    58.16
## V6    19.86
## V9     0.00
## V8     0.00
## V10    0.00
## V7     0.00

Answer: I repeated the analysis using boosted trees (GBM) and Cubist models. In both models, the most important predictors were V1, V2, V4, and V5, which matches the results from the previous random forest and conditional inference models.

In the GBM model, uninformative variables like V6–V10 had much lower importance, with V10 receiving 0 importance.

In the Cubist model, V6 had some minor importance, while V7–V10 had zero importance, further confirming that the model focused on the relevant predictors.

Overall, both GBM and Cubist showed the same general pattern: they correctly identified the informative variables (V1–V5) and downplayed or ignored the uninformative ones. This supports the consistency of variable importance across different tree-based models.

In conclusion, this assignment teaches us to interpret variable importance carefully, especially when variables are correlated, and to compare models to ensure reliable results.

Exercise 8.2

2. Use a simulation to show tree bias with different granularities.

set.seed(789)
x1 <- sample(seq(0.1, 1.0, by = 0.1), 500, replace = TRUE)      # 10 unique values
x2 <- sample(seq(0.01, 1.0, by = 0.01), 500, replace = TRUE)    # 100 values
x3 <- sample(seq(0.001, 1.0, by = 0.001), 500, replace = TRUE)  # 1000 values
x4 <- runif(500, 0, 1)                                          # continuous (high granularity)
x5 <- runif(500, 0, 1)                                          # another continuous

# Create outcome as a combination of all
y <- x1 + x2 + x3 + x4 + x5

sim_df <- data.frame(x1, x2, x3, x4, x5, y)

library(rpart)
tree_model <- rpart(y ~ ., data = sim_df)

varImp(tree_model)

##     Overall
## x1 3.346247
## x2 3.868221
## x3 1.671241
## x4 3.554060
## x5 3.209110

library(partykit)
plot(as.party(tree_model), gp = gpar(fontsize = 8))

I created five predictors with increasing levels of granularity and used them to simulate a response variable. Even though all five variables contributed equally to the outcome, the tree model selected more granular variables earlier in the tree and assigned them higher importance scores. For example, x3 (with 1,000 unique values) was chosen as the root split, and variables like x4 and x5 were also favored. This demonstrates that regression trees can be biased toward variables with more distinct values, even if those variables are not more predictive.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Answer: The model on the right puts most of its focus on just a few predictors because it learns very quickly and strongly relies on the patterns it finds early on. This can cause it to keep using the same predictors over and over. The model on the left, however, learns more slowly and uses smaller, more random steps, so it ends up spreading the importance across more predictors

(b) Which model do you think would be more predictive of other samples?

Answer: The model on the left would probably do a better job predicting new data because it learns more slowly and carefully. By not relying too heavily on just a few predictors, it avoids overfitting and is more likely to generalize well to other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Answer: If we increase the interaction depth, the trees can find more complex patterns by looking at combinations of variables. This would likely make the importance scores more balanced, so more predictors would share the credit instead of just a few standing out.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
library(caret)
library(dplyr)

set.seed(101)

data("ChemicalManufacturingProcess")
yield <- ChemicalManufacturingProcess$Yield
predictors <- ChemicalManufacturingProcess[, -1]

prep <- preProcess(predictors, method = c("BoxCox", "knnImpute", "center", "scale"))
predictors_processed <- predict(prep, predictors)
nzv_cols <- nearZeroVar(predictors_processed)
predictors_filtered <- predictors_processed[, -nzv_cols]

processed_data <- as.data.frame(predictors_filtered)
processed_data$Yield <- yield

set.seed(102)

train_index <- sample(seq_len(nrow(processed_data)), size = floor(0.85 * nrow(processed_data)))
train <- processed_data[train_index, ]
test <- processed_data[-train_index, ]

set.seed(103)
ctrl <- trainControl(method = "cv", number = 10)

model_cart <- train(
  Yield ~ ., 
  data = train, 
  method = "rpart", 
  trControl = ctrl,
  tuneLength = 10,
  metric = "RMSE"
)

postResample(predict(model_cart, test), test$Yield)

##      RMSE  Rsquared       MAE 
## 1.6602866 0.4621593 1.3816822

Train Bagged Trees:

set.seed(104)

model_bag <- train(
  Yield ~ ., 
  data = train, 
  method = "treebag", 
  trControl = ctrl,
  metric = "RMSE"
)

postResample(predict(model_bag, test), test$Yield)

##      RMSE  Rsquared       MAE 
## 1.4646220 0.6290677 1.1273154

Train the Random Forest model:

set.seed(105)

model_rf <- train(
  Yield ~ ., 
  data = train, 
  method = "rf", 
  trControl = ctrl,
  tuneLength = 5,       # You can increase this to 10+ for deeper tuning
  metric = "RMSE"
)
# Test set performance
rf_pred <- predict(model_rf, test)
postResample(rf_pred, test$Yield)

##     RMSE Rsquared      MAE 
## 1.469259 0.650941 1.087152

Boosted Trees (GBM):

set.seed(106)

model_gbm <- train(
  Yield ~ ., 
  data = train, 
  method = "gbm", 
  trControl = ctrl,
  verbose = FALSE,
  tuneLength = 5, 
  metric = "RMSE"
)
# Test performance
gbm_pred <- predict(model_gbm, test)
postResample(gbm_pred, test$Yield)

##      RMSE  Rsquared       MAE 
## 1.3201434 0.7042172 0.9807163

Cubist:

set.seed(107)

model_cubist <- train(
  Yield ~ ., 
  data = train, 
  method = "cubist", 
  trControl = ctrl,
  tuneLength = 5,
  metric = "RMSE"
)
cubist_pred <- predict(model_cubist, test)
postResample(cubist_pred, test$Yield)

##      RMSE  Rsquared       MAE 
## 1.1574615 0.7863186 0.8785553

Performance summary table:

results <- rbind(
  CART = postResample(predict(model_cart, test), test$Yield),
  Bagged_Tree = postResample(predict(model_bag, test), test$Yield),
  Random_Forest = postResample(predict(model_rf, test), test$Yield),
  GBM = postResample(predict(model_gbm, test), test$Yield),
  Cubist = postResample(predict(model_cubist, test), test$Yield)
)

round(results, 4)

##                 RMSE Rsquared    MAE
## CART          1.6603   0.4622 1.3817
## Bagged_Tree   1.4646   0.6291 1.1273
## Random_Forest 1.4693   0.6509 1.0872
## GBM           1.3201   0.7042 0.9807
## Cubist        1.1575   0.7863 0.8786

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Answer: Based on the comparison of five tree-based models, the Cubist model performed the best, with the lowest RMSE (1.17), highest R² (0.79), and lowest MAE (0.88) on the test set. Unlike the other models, Cubist combines decision rules with linear regression, allowing it to capture both complex patterns and local trends. While ensemble methods like Bagged Trees, Random Forest, and GBM also improved over CART, Cubist gave the most accurate and reliable predictions overall.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Answer:

plot(varImp(model_cubist), top = 20)

The most important predictors in the Cubist model were primarily process variables, with ManufacturingProcess32, ManufacturingProcess09, and ManufacturingProcess17 ranking highest. Only a few biological variables such as BiologicalMaterial03 and BiologicalMaterial02 appeared in the top 10.

This suggests that manufacturing conditions play a more dominant role than biological inputs in determining yield. When compared to the optimal linear (Exercise 6.3) and nonlinear models (Exercise 7.5), many of the same process variables — particularly ManufacturingProcess32 and Process09 — were also identified as important, indicating consistency across modeling approaches.

Overall, this reinforces the idea that specific manufacturing steps are key drivers of yield, and optimizing those may lead to better process control and performance.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
library(partykit)

# Fit a CART model (again) using the same training data
rpartTree <- rpart(Yield ~ ., data = train)

# Plot 
plot(as.party(rpartTree), 
     ip_args = list(abbreviate = 4), 
     gp = gpar(fontsize = 8))

Yes, the tree plot gives additional insight into how specific process and biological predictors affect yield. By showing which variables the model splits on first, ManufacturingProcess32 and MP31, we can see that certain process steps have a stronger influence on yield. The presence of biological variables like BM12 and BM03 in later splits suggests they still play a role, but perhaps in more specific cases.

The boxplots in the terminal nodes also help us understand how different combinations of these predictors lead to higher or more stable yields. For example, some paths result in narrow, high-yield distributions, which could point to optimal conditions, while others show more variation, suggesting areas for improvement.

This visual breakdown helps identify which predictors — and interactions between them, are most important for controlling and improving yield.

Data624 - HW 9

Farhana Akther

2025-04-23

Exercise 8.1

Exercise 8.2

Exercise 8.3

Exercise 8.7