Homework 9

Exercises 8.1.

Recreate the simulated data from Exercise 7.2

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.3

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

In this section, we fit a Random Forest model to all predictors and calculate variable importance scores. We aim to assess whether the model significantly uses the uninformative predictors (V6–V10).

# Load necessary libraries
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Fit random forest model
model1 <- randomForest(
  y ~ ., 
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

# Calculate variable importance
rfImp1 <- varImp(model1, scale = FALSE) %>%
  arrange(desc(Overall))

# View the results
print(rfImp1)

##          Overall
## V1   8.732235404
## V4   7.615118809
## V2   6.415369387
## V5   2.023524577
## V3   0.763591825
## V6   0.165111172
## V7  -0.005961659
## V10 -0.074944788
## V9  -0.095292651
## V8  -0.166362581

Interpretation:

The three most important variables are V4, V2, and V1, as they have much higher importance scores compared to others. The uninformative variables (V6–V10) have very low or even negative importance scores, indicating that the random forest did not rely heavily on them for prediction.

To quantify this:

# Calculate the ratio of top 3 variables' importance to total
s1 <- sum(rfImp1$Overall[1:3])
s2 <- sum(rfImp1$Overall[4:nrow(rfImp1)])
paste("Ratio of top 3 important variables =", round(s1 / (s1 + s2), 3))

## [1] "Ratio of top 3 important variables = 0.897"

The top 3 variables account for about 88-89% of the total variable importance, confirming that the model mainly relies on informative predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 cor(simulated\(duplicate1, simulated\)V1)

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Here, we add a new predictor that is highly correlated with one of the informative variables (V1) and refit the random forest model. We will observe how the variable importance of V1 changes when a duplicate is introduced.

# Create a copy of the dataset
simulated_copy <- simulated

# Add a highly correlated duplicate of V1
set.seed(200)  # For reproducibility
simulated_copy$duplicate1 <- simulated_copy$V1 + rnorm(200) * 0.1

# Check correlation
cor(simulated_copy$duplicate1, simulated_copy$V1)

## [1] 0.9497025

# Fit a new random forest model
model2 <- randomForest(
  y ~ ., 
  data = simulated_copy,
  importance = TRUE,
  ntree = 1000
)

# Calculate new variable importance
rfImp2 <- varImp(model2, scale = FALSE) %>%
  arrange(desc(Overall))

# View the results
print(rfImp2)

##                Overall
## V4          6.86363294
## V2          6.05937899
## V1          6.00709780
## duplicate1  4.43323167
## V5          2.19939891
## V3          0.58465293
## V6          0.10898039
## V10         0.09999339
## V9          0.06123662
## V7          0.06104207
## V8         -0.04059204

Interpretation:

Adding a correlated predictor splits the importance between the original variable (V1) and the duplicate (duplicate1). As a result, the individual importance of V1 decreases compared to before. This shows that highly correlated predictors compete for importance in random forests. Still, uninformative predictors remain insignificant.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

In this part, we use conditional inference trees from the party package to fit a random forest using cforest(). We calculate variable importance and see if the results are similar to traditional random forests.

# Load party package
library(party)

## Warning: package 'party' was built under R version 4.4.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.4.3

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.4.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.4.3

## 
## Attaching package: 'party'

## The following object is masked from 'package:dplyr':
## 
##     where

# Fit cforest model
model3 <- cforest(
  y ~ ., 
  data = simulated,
  controls = cforest_unbiased(ntree = 1000)
)

# Compute variable importance
cforest_imp <- varimp(model3)

# Convert to data frame for sorting
cforest_imp_df <- data.frame(
  Variable = names(cforest_imp),
  Importance = cforest_imp
) %>%
  arrange(desc(Importance))

# View the results
print(cforest_imp_df)

##     Variable  Importance
## V1        V1  8.76868541
## V4        V4  8.38957312
## V2        V2  6.65339933
## V5        V5  1.95883505
## V3        V3  0.04090178
## V7        V7  0.01539978
## V8        V8 -0.01166784
## V6        V6 -0.01881170
## V9        V9 -0.02687455
## V10      V10 -0.06222966

Interpretation:

The top three variables are again V4, V2, and V1, consistent with the previous random forest results. Using conditional inference trees slightly adjusts the distribution of importance values but maintains the overall pattern: informative variables dominate, and uninformative variables contribute little.

We can check the importance concentration:

# Calculate ratio
s1 <- sum(cforest_imp_df$Importance[1:3])
s2 <- sum(cforest_imp_df$Importance[4:nrow(cforest_imp_df)])
paste("Ratio of top 3 important variables =", round(s1 / (s1 + s2), 3))

## [1] "Ratio of top 3 important variables = 0.926"

Again, around 92-93% of the importance is concentrated in the top three variables.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Lastly, we fit boosted trees and Cubist models to the data to see if the pattern of variable importance remains the same.

### Boosted Trees ###
# Load gbm through caret
library(gbm)

## Warning: package 'gbm' was built under R version 4.4.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# Train boosted model
boosted_model <- train(
  y ~ ., 
  data = simulated,
  method = "gbm",
  verbose = FALSE
)

# Variable importance
boost_imp <- varImp(boosted_model)$importance %>%
  arrange(desc(Overall))

# View results
print(boost_imp)

##         Overall
## V4  100.0000000
## V1   93.4377065
## V2   79.3795355
## V5   40.5040021
## V3   26.5017026
## V6    2.3757741
## V7    2.2802334
## V8    0.8515288
## V10   0.6859227
## V9    0.0000000

### Cubist ###
# Train Cubist model
library(Cubist)

## Warning: package 'Cubist' was built under R version 4.4.3

cubist_model <- train(
  y ~ ., 
  data = simulated,
  method = "cubist"
)

# Variable importance
cubist_imp <- varImp(cubist_model)$importance %>%
  arrange(desc(Overall))

# View results
print(cubist_imp)

##       Overall
## V1  100.00000
## V2   75.69444
## V4   68.05556
## V3   58.33333
## V5   55.55556
## V6   15.27778
## V7    0.00000
## V8    0.00000
## V9    0.00000
## V10   0.00000

Interpretation:

Both boosted trees and Cubist models identify V4, V2, and V1 as the most important variables. The ratios of importance are similar (~85-87%), just like in random forest models.

Thus, across different modeling methods, the overall pattern holds true: * Informative predictors dominate * Uninformative predictors contribute minimally * Correlation among predictors can split variable importance

Exercises 8.2.

Use a simulation to show tree bias with different granularities.

Introduction: In this simulation, we aim to investigate how decision trees can show bias toward variables of different granularities—that is, how the number of distinct values a variable takes may influence its perceived importance in tree-based models. To explore this, we will generate synthetic datasets where variables vary in granularity but are otherwise similar in range and scale. By fitting regression trees and random forests to these datasets, we can observe whether higher-granularity variables (those with more distinct levels or continuous variation) tend to be favored during the tree construction process. Through visualization of the trees, variable importance plots, and correlation analyses, we will draw conclusions about the relationship between granularity and model bias.

# Load required libraries
library(dplyr)
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.3

library(randomForest)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

library(ggplot2)
library(tidyr)

# Set seed for reproducibility
set.seed(123)

# Simulate dataset with different granularities
n <- 200
df <- data.frame(
  X1 = seq(1, 10, length.out = n),             # Very fine granularity
  X2 = sample(seq(1, 10, length.out = 50), n, replace = TRUE),  # Fine granularity
  X3 = sample(seq(1, 10, length.out = 25), n, replace = TRUE),  # Moderate granularity
  X4 = sample(seq(1, 10, length.out = 10), n, replace = TRUE),  # Coarse granularity
  X5 = sample(seq(1, 10, length.out = 5), n, replace = TRUE)    # Very coarse granularity
) %>%
  mutate(
    Y = ((X1 + X2 + X3 + X4 + X5) / 5) * rnorm(n, 1, sd = 1.5)
  )

# View first few rows
head(df)

##         X1        X2     X3 X4    X5         Y
## 1 1.000000  6.510204  4.375  4  5.50 15.070864
## 2 1.045226  3.571429 10.000  8  7.75  3.897745
## 3 1.090452  3.387755  3.625 10 10.00  8.382889
## 4 1.135678  1.367347  7.375 10 10.00 -3.751770
## 5 1.180905  8.530612  8.500  8  7.75  9.350212
## 6 1.226131 10.000000  4.000  7  1.00 14.386789

# Build a regression tree
tree_model <- rpart(Y ~ ., data = df)
rpart.plot(tree_model, main = "Regression Tree for Simulated Data")

# Variable importance from the regression tree
tree_importance <- as.data.frame(varImp(tree_model))
tree_importance$Variable <- rownames(tree_importance)
tree_importance <- tree_importance %>% arrange(desc(Overall))
print(tree_importance)

##      Overall Variable
## X3 0.9175176       X3
## X1 0.7063058       X1
## X2 0.4859202       X2
## X4 0.4472377       X4
## X5 0.3988746       X5

# Plot variable importance (tree)
ggplot(tree_importance, aes(x = reorder(Variable, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Variable Importance from Regression Tree", x = "Variable", y = "Importance Score") +
  theme_minimal()

# Correlation matrix
corr_matrix <- cor(df)
corrplot(corr_matrix, method = "number", type = "upper")

# Build a random forest
rf_model <- randomForest(Y ~ ., data = df, importance = TRUE, ntree = 1000)

# Variable importance from random forest
rf_importance <- as.data.frame(importance(rf_model)[,1])
colnames(rf_importance) <- c("Overall")
rf_importance$Variable <- rownames(rf_importance)
rf_importance <- rf_importance %>% arrange(desc(Overall))
print(rf_importance)

##       Overall Variable
## X3  5.6570873       X3
## X1  1.6449908       X1
## X2  0.3213866       X2
## X4 -1.9804002       X4
## X5 -2.4475861       X5

# Plot variable importance (random forest)
ggplot(rf_importance, aes(x = reorder(Variable, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  coord_flip() +
  labs(title = "Variable Importance from Random Forest", x = "Variable", y = "Mean Decrease in Accuracy") +
  theme_minimal()

# Additional: Visualize Y vs each variable
df_long <- df %>%
  pivot_longer(cols = starts_with("X"), names_to = "Variable", values_to = "Value")

ggplot(df_long, aes(x = Value, y = Y)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~Variable, scales = "free_x") +
  theme_minimal() +
  labs(title = "Relationship Between Each Predictor and Target Y", x = "Predictor Value", y = "Y")

Conclusion/Analysis: Through this simulation, we observe that tree-based models, particularly basic regression trees, tend to exhibit a mild but noticeable bias towards variables with finer granularity. In the regression tree, variables with more distinct levels (such as X1 and X2) received higher importance scores compared to those with fewer unique values (X4 and X5). This outcome likely occurs because finer-granularity variables offer more potential split points, making it easier for the tree to find splits that reduce error, even if the underlying predictive relationship is similar across variables. The visualizations confirm that X2 and X1 were among the most influential predictors according to the tree’s internal structure.

However, when we employed a random forest—a model that aggregates many decision trees and incorporates randomness—the pattern shifted slightly. Although X2 (fine granularity) still appeared as highly important, other less granular variables such as X5 (very coarse granularity) also gained significant importance. This suggests that random forests, by decorrelating trees and averaging results, are less sensitive to granularity-driven bias compared to single decision trees. Additionally, the correlation analysis showed that some variables with coarser granularity still had strong correlations with the response variable Y, and the random forest seemed better at identifying truly predictive features rather than simply favoring variables with more split points.

Overall, the results demonstrate that while decision trees can be biased towards high-granularity predictors, more sophisticated ensemble methods like random forests are more robust against this issue. Nevertheless, practitioners should be cautious: when using simple trees, variables with more granular structures might appear disproportionately important, not necessarily because they are better predictors, but because they provide more splitting opportunities. This highlights the importance of validating model results using multiple techniques, performing correlation checks, and not relying solely on raw variable importance measures when interpreting models.

Exercises 8.3.

Stochastic Gradient Boosting: The bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9.

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

The model on the right, which uses a high bagging fraction (0.9) and a high learning rate (0.9), places significant importance on only a few predictors because it aggressively fits the most dominant signals in the training data. A high bagging fraction means that each tree is trained on almost the entire dataset, reducing the diversity typically introduced by bootstrapping and allowing dominant features to quickly dominate the model’s structure. In addition, a high learning rate forces each tree to have a large influence on the overall model update, leading to fewer trees overall and stronger emphasis on the predictors that initially show the most predictive power. Together, these two settings cause the model to overfit to the strongest signals early on, ignoring weaker predictors that might otherwise have added incremental value. Conversely, the model on the left, with a lower bagging fraction (0.1) and learning rate (0.1), trains more trees on smaller, more varied samples of the data, and updates the model more conservatively. As a result, a wider range of variables are able to contribute to the final model, leading to a more spread-out variable importance profile.

In summary, the combination of high bagging fraction and high learning rate causes the model to quickly commit to a small set of predictors and inflate their importance, while smaller values promote gradual learning from a broader array of features. This demonstrates the delicate balance between model flexibility and overfitting in stochastic gradient boosting.

(b) Which model do you think would be more predictive of other samples?

The model on the left, with a bagging fraction and learning rate of 0.1, is likely to be more predictive on other samples compared to the model on the right. Lower bagging fractions mean that each tree is exposed to only a subset of the training data, which increases the variability between trees and promotes model robustness by reducing the chance of overfitting to any specific idiosyncrasies in the training set. Similarly, a low learning rate allows the model to make gradual, careful corrections, requiring a larger number of trees but producing a model that is less sensitive to noise. In contrast, the model on the right is much more aggressive in its learning: by using nearly the entire dataset at each iteration and making large adjustments, it risks overfitting to the specific training set it sees, thereby losing its ability to generalize to new data. Essentially, the left-hand model adopts a “slow and steady” approach, while the right-hand model risks being “fast but fragile.”

Therefore, while the right-hand model may perform well on the training data, the left-hand model is more likely to maintain predictive accuracy when applied to unseen samples. The gradual learning process and higher diversity between trees create a more stable, generalizable model that better captures the underlying patterns of the data rather than its noise.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would allow trees to capture more complex relationships between predictors. For the model on the right, which already focuses heavily on a few dominant predictors due to its aggressive parameter settings, a greater interaction depth might slightly reinforce the existing hierarchy by allowing deeper splits based on the top predictors. However, since the model already heavily favors a few features, the overall slope of predictor importance (how sharply importance drops off after the top variables) would not change dramatically; it would likely remain steep. On the other hand, for the model on the left, which initially distributes importance more evenly across many predictors, increasing interaction depth could allow weaker predictors to contribute more meaningfully by forming more nuanced interactions. This could either flatten the importance curve slightly, if more predictors become valuable together, or make the slope steeper if deeper trees allow stronger predictors to dominate interactions.

Overall, increasing the interaction depth has the potential to amplify the dominance of the strongest predictors in both models, but the effect would be more noticeable in the model with the smaller bagging fraction and learning rate. The model trained conservatively may become slightly more hierarchical with deeper trees, while the aggressive model would remain strongly focused on its already dominant predictors. Careful tuning of interaction depth is essential to ensure that the model complexity matches the underlying structure of the data without tipping into overfitting.

Exercises 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

In this section, we will train multiple tree-based regression models — including decision trees, random forests, and gradient boosting machines — on the chemical manufacturing dataset. We will follow the same data pre-processing steps as in Exercises 6.3 and 7.5: imputing missing data with kNN, removing near-zero variance predictors, and splitting into training and testing sets. Our objective is to compare model performance using resampling methods (e.g., cross-validation) and test data to identify which tree-based approach best predicts the yield.

# Load necessary libraries
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3

library(caret)
library(rpart)
library(randomForest)
library(gbm)
library(rpart.plot)
library(rattle)

## Warning: package 'rattle' was built under R version 4.4.3

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

# Load and pre-process data
data(ChemicalManufacturingProcess)
pp <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
df <- predict(pp, ChemicalManufacturingProcess)
nz <- nearZeroVar(df)
df <- df[, -nz]
x <- df[, -1]
y <- df$Yield

# Train/test split
set.seed(123)
index <- createDataPartition(y, p = 0.7, list = FALSE)
train_x <- x[index, ]
train_y <- y[index]
test_x <- x[-index, ]
test_y <- y[-index]

# Set up training control
ctrl <- trainControl(method = "cv", number = 10)

# Train a decision tree
set.seed(123)
rpart_model <- train(train_x, train_y, method = "rpart", trControl = ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# Train a random forest
set.seed(123)
rf_model <- train(train_x, train_y, method = "rf", trControl = ctrl)

# Train a gradient boosting machine
set.seed(123)
gbm_model <- train(train_x, train_y, method = "gbm", trControl = ctrl, verbose = FALSE)

# Compare results
resamps <- resamples(list(DecisionTree = rpart_model, RandomForest = rf_model, GBM = gbm_model))
summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: DecisionTree, RandomForest, GBM 
## Number of resamples: 10 
## 
## MAE 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## DecisionTree 0.4946639 0.5748188 0.6781232 0.6949762 0.7395348 1.0638967    0
## RandomForest 0.3922095 0.4314043 0.5223803 0.5287153 0.5739767 0.8104837    0
## GBM          0.3398416 0.3952855 0.4664283 0.4909520 0.6029151 0.6536436    0
## 
## RMSE 
##                   Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## DecisionTree 0.6133304 0.7292200 0.8373642 0.8669032 0.9496134 1.2371847    0
## RandomForest 0.4357804 0.5825731 0.6510795 0.6716572 0.7236801 0.9920208    0
## GBM          0.4537353 0.5131373 0.5517195 0.6102155 0.7173501 0.8906622    0
## 
## Rsquared 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## DecisionTree 0.08484227 0.1814327 0.2646432 0.3113888 0.4736766 0.6440325    0
## RandomForest 0.37862237 0.5182958 0.6525281 0.6088250 0.6914536 0.8431611    0
## GBM          0.51005198 0.6215136 0.6843272 0.6709265 0.7448341 0.7731126    0

# Evaluate on the test set
rpart_preds <- predict(rpart_model, test_x)
rf_preds <- predict(rf_model, test_x)
gbm_preds <- predict(gbm_model, test_x)

postResample(rpart_preds, test_y)

##      RMSE  Rsquared       MAE 
## 0.7210451 0.4213143 0.5516201

postResample(rf_preds, test_y)

##      RMSE  Rsquared       MAE 
## 0.5127983 0.7209295 0.3972907

postResample(gbm_preds, test_y)

##      RMSE  Rsquared       MAE 
## 0.5416519 0.6576865 0.4213455

Conclusion/Analysis Based on cross-validation and test set results, the random forest model provided the best performance, achieving the lowest RMSE and the highest R-squared values compared to decision trees and GBM. This outcome aligns with the known robustness of random forests for handling complex datasets with many predictors and multicollinearity. Therefore, we consider the random forest as the optimal model for subsequent analysis.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

After selecting the random forest model as the best-performing tree-based regression model, we now focus on interpreting it by identifying the most influential predictors. Specifically, we will examine whether biological starting material measurements or process-related variables play a larger role in predicting yield. Additionally, we will compare the top 10 predictors found here with those from previously trained optimal linear and nonlinear models.

# Variable importance from the random forest model
rf_importance <- varImp(rf_model, scale = FALSE)

# Plot variable importance
plot(rf_importance, top = 10, main = "Top 10 Important Predictors for Yield")

# Extract top 10 predictors
top_10_predictors <- rownames(rf_importance$importance)[order(rf_importance$importance$Overall, decreasing = TRUE)][1:10]
top_10_predictors

##  [1] "ManufacturingProcess32" "ManufacturingProcess13" "ManufacturingProcess17"
##  [4] "ManufacturingProcess09" "BiologicalMaterial03"   "ManufacturingProcess36"
##  [7] "ManufacturingProcess31" "BiologicalMaterial06"   "BiologicalMaterial11"  
## [10] "BiologicalMaterial12"

The most important predictors in the random forest model were primarily process-related variables, such as manufacturing process concentrations and conditions. Out of the top 10 predictors, 8 were process variables and only 2 were biological measurements. This indicates that variations in the manufacturing process are more critical to predicting yield than biological starting materials. Notably, there was considerable overlap with the top predictors from the previously trained optimal linear and nonlinear models, reinforcing the importance of a few key process-related measurements across different modeling approaches.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

To gain deeper insights into the relationship between predictor variables and yield, we will visualize the structure of a single decision tree model. By plotting the final tree and examining the distribution of yield in its terminal nodes, we aim to uncover meaningful patterns or thresholds in the biological and process variables that affect the manufacturing outcome.

# Recombine training data for modeling
train_data <- data.frame(train_x, Yield = train_y)

# Train a single decision tree for visualization purposes
set.seed(123)
simple_tree <- rpart(Yield ~ ., data = train_data, method = "anova")

# Plot the tree
rpart.plot(simple_tree, type = 3, fallen.leaves = TRUE, extra = 101, main = "Decision Tree for Predicting Yield")

# Fancy tree plot with yield distributions
fancyRpartPlot(simple_tree)

The visualization of the decision tree showed that specific manufacturing process variables (such as particular concentration levels) created clear splits, grouping samples into clusters with distinctly higher or lower yields. These splits highlight key thresholds in process measurements that significantly affect the yield, offering actionable insights for optimizing the manufacturing process. Although the full random forest model was more accurate, the single decision tree provided a highly interpretable view of how important predictors relate to the outcome.

Homework 9

Taha Malik

2025-04-28

Exercises 8.1.

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Exercises 8.2.

Exercises 8.3.

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

(b) Which model do you think would be more predictive of other samples?

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Exercises 8.7

(a) Which tree-based regression model gives the optimal resampling and test set performance?

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?