#Chaper 8 Exercises
#8.1 Recreate the simulated data from Exercise 7.2:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
##(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
First, we recreate the Friedman dataset. This dataset is a classic benchmark for regression models because it contains non-linear relationships and interactions among the first five variables, while the last five are purely noise.
library(mlbench)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
library(party)
## Warning: package 'party' was built under R version 4.5.2
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.5.2
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.5.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.5.2
# Set seed for reproducibility
set.seed(200)
# Generate simulated data
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# Fit a Random Forest model with 1000 trees
set.seed(200)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
# Estimate variable importance
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1[order(-rfImp1$Overall), , drop = FALSE])
## Overall
## V1 8.605365900
## V4 7.883384091
## V2 6.831259165
## V5 2.244750293
## V3 0.741534943
## V6 0.136054182
## V7 0.055950944
## V9 0.003196175
## V10 -0.054705900
## V8 -0.068195812
Did the random forest model significantly use the uninformative predic tors (V6– V10)?
No, the model did not use them significantly. Based on the variable importance scores, the informative predictors (\(V1\) through \(V5\)) clearly dominate the model. While \(V6\) through \(V10\) have importance values greater than zero, they are drastically lower than the informative variables. This happens because Random Forest occasionally selects noise variables for splits by chance, but they do not contribute to a meaningful reduction in the Sum of Squared Errors (SSE).
##(b) Now add an additional predictor that is highly correlated with one of the informative predictors.
In this section, we introduce redundancy by adding a variable highly correlated with \(V1\).
# Add an additional predictor highly correlated with V1
set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
# Fit a second model
set.seed(200)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2[order(-rfImp2$Overall), , drop = FALSE])
## Overall
## V4 6.79689905
## V2 6.19650787
## V1 6.05096011
## duplicate1 4.20641786
## V5 2.26790879
## V3 0.52249125
## V6 0.19172138
## V7 0.03766832
## V9 0.02041124
## V10 -0.04481192
## V8 -0.08406511
Fit another random forest model to these data. Did the importance score for V1 change?
Yes, the importance of \(V1\) decreased significantly, dropping from 8.60 to 6.05. This is a classic symptom of variable redundancy. In the original model, \(V1\) was the primary predictor. By introducing duplicate1 (which is highly correlated with \(V1\)), the model now has two pathways to access the same information.
What happens when you add another predictor that is also highly correlated with V1?
If we were to add a duplicate2, the importance score of \(V1\) would continue to fragment. Instead of a single dominant predictor, we would see three variables (\(V1\), duplicate1, and duplicate2) with mid-to-low importance scores.
##(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
# Set seed for consistency with your previous results
set.seed(200)
# Fit cforest model
model_cf <- cforest(y ~ ., data = simulated,
control = cforest_unbiased(ntree = 1000))
# Calculate importance - focusing on the V1 vs duplicate1 relationship
imp_conditional <- varimp(model_cf, conditional = TRUE)
# Sorting and displaying results
print(sort(imp_conditional, decreasing = TRUE))
## V4 V2 V1 duplicate1 V5 V7
## 6.22339905 4.82582001 3.33759413 1.39182243 1.25721990 0.01685248
## V6 V3 V9 V10 V8
## 0.01194947 0.01146137 0.00389434 0.00277754 -0.01632860
Do these importances show the same pattern as the traditional random forest model?
No, the pattern is distinct and more refined. While the standard Random Forest in Part (B) assigned a high importance to duplicate1 (4.20), the conditional model reduces it significantly to 1.39.
Cubist. Does the same pattern occur?
library(gbm)
## Warning: package 'gbm' was built under R version 4.5.2
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.5.2
# 1. Fit Gradient Boosting Machine (GBM)
set.seed(200)
gbmModel <- gbm(y ~ ., data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 7)
# 2. Fit Cubist Model
set.seed(200)
cubistModel <- cubist(x = simulated[, names(simulated) != "y"],
y = simulated$y,
committees = 100)
# 3. Extract and Compare Importance
gbmImp <- summary(gbmModel, plotit = FALSE)
cubistImp <- varImp(cubistModel)
# Organize for your professor
print("GBM Variable Importance:")
## [1] "GBM Variable Importance:"
print(gbmImp)
## var rel.inf
## V4 V4 27.364731
## V2 V2 21.444013
## V1 V1 17.540196
## V5 V5 10.718890
## V3 V3 7.407206
## duplicate1 duplicate1 5.807314
## V7 V7 2.482701
## V10 V10 2.208336
## V6 V6 2.174890
## V9 V9 1.653963
## V8 V8 1.197760
print("Cubist Variable Importance:")
## [1] "Cubist Variable Importance:"
print(cubistImp)
## Overall
## V1 71.0
## V3 46.5
## V2 59.0
## V4 48.0
## V5 32.5
## V6 12.0
## V8 1.0
## V7 0.0
## V9 0.0
## V10 0.0
## duplicate1 0.0
The comparison across different tree-based ensembles shows that the ‘importance’ of a variable is highly dependent on the model’s underlying algorithm. While Random Forest suffers from significant importance dilution when faced with correlated predictors, Boosting (GBM) mitigates this by focusing on residuals, and Cubist eliminates the redundancy entirely. This confirms that for datasets with high multicollinearity, standard Random Forest importance may be misleading, and models like Cubist or Conditional Inference Forests provide a more reliable hierarchy of predictors.
# Creating a summary table for exercise 8.2 analysis
comparison_results <- data.frame(
Variable = c("V1", "V4", "V2", "duplicate1"),
RF_Importance = c(6.05, 6.79, 6.19, 4.20),
GBM_Importance = c(17.54, 27.36, 21.44, 5.80),
Cubist_Importance = c(71.0, 48.0, 59.0, 0.0)
)
print("Comparison of Importance across Models:")
## [1] "Comparison of Importance across Models:"
print(comparison_results)
## Variable RF_Importance GBM_Importance Cubist_Importance
## 1 V1 6.05 17.54 71
## 2 V4 6.79 27.36 48
## 3 V2 6.19 21.44 59
## 4 duplicate1 4.20 5.80 0
The discrepancy in importance scores arises from the fundamental difference between Bagging (parallel growth with random subsets) and Boosting/Rule-based models (sequential/optimization growth). Bagging is inclusive of redundant features, whereas Boosting and Cubist are naturally more competitive and parsimonious.
##(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?
The difference in the distribution of variable importance between the two models is primarily driven by the interaction between the learning rate (shrinkage) and the bagging fraction (stochastic component).
This model operates under an “aggressive” and deterministic learning strategy. With a high learning rate of 0.9, the initial trees in the boosting sequence capture a massive amount of information from the strongest predictors almost immediately. Because each tree reduces the error so significantly, there is very little residual variance left for secondary predictors to explain in subsequent iterations. Furthermore, a high bagging fraction of 0.9 means that almost the entire dataset is used to build every tree. This lack of variation ensures that the algorithm consistently identifies and reinforces the same few dominant variables, resulting in a very steep importance curve where only a few predictors have high scores.
This model follows a “conservative” and exploratory strategy. By setting the learning rate to 0.1, each tree contributes only a small fraction to the final prediction. This forces the model to take many more steps to reach a solution, preventing the dominant predictors from “monopolizing” the importance scores early on. Additionally, the low bagging fraction of 0.1 introduces a high degree of stochasticity; since each tree only sees a random 10% of the data, different predictors have a much higher chance of being selected as “optimal” in different trees. This randomness, combined with slow learning, effectively distributes the importance across a much larger number of predictors.
In summary, the model on the right acts as a greedy optimizer that quickly converges on the most obvious signals, while the model on the left acts as a robust explorer that uses randomness and incremental steps to account for the contributions of a wider array of variables.
The model on the left (with Learning Rate = 0.1 and Bagging Fraction = 0.1) would generally be more predictive and generalize better to other samples compared to the model on the right.
The technical reasoning for this choice is based on the Bias-Variance Tradeoff:
Overfitting Risk in the Right Model: The model on the right uses an extremely high learning rate (0.9). In gradient boosting, a high learning rate means the model learns the training data very quickly, often “memorizing” specific patterns and noise rather than general trends. This makes the model highly susceptible to overfitting, meaning it will perform exceptionally well on training data but poorly on new, unseen samples.
Generalization in the Left Model: The model on the left is built on “slow learning.” A lower learning rate (0.1) allows the boosting process to incrementally improve the model, which typically leads to a more stable and accurate solution. Furthermore, the low bagging fraction (0.1) introduces a regularization effect. By training each tree on a small random subset of the data, the model becomes less sensitive to outliers and specific anomalies in the training set.
While the model on the right might show a lower error on the training set, the model on the left is structurally designed to be more robust. The combination of a conservative learning rate and high stochasticity (low bagging) helps the model capture the underlying structural relationships of the data without being distracted by noise, resulting in superior predictive performance on new samples.
Increasing the interaction depth (also known as tree depth or max depth) would generally make the slope of the predictor importance steeper for both models.The technical explanation for this effect is as follows:
Capturing Complex Interactions: Interaction depth determines the maximum number of nodes from the root to the furthest leaf. A shallow tree (low depth) can only capture simple relationships. As you increase the depth, the model is allowed to capture complex, high-order interactions between variables.
Reinforcing Strong Predictors: In a boosting framework, if a specific predictor (like \(V1\)) is very strong, a deeper tree allows that variable to appear in multiple splits within the same tree. This enables the variable to “explain” more complex patterns in the data in a single iteration.
Impact on the Slope: When trees are deeper, the algorithm becomes more efficient at identifying the most influential predictors and their interactions. This typically causes the importance scores of the “top-tier” variables to increase significantly relative to the others. Consequently, the gap between the primary predictors and the noise variables widens, resulting in a steeper decline (slope) in the importance
Increasing the interaction depth allows the model to be more “surgical” and specific. By allowing for more complex splits, the model will likely concentrate more importance on the variables that drive those complex interactions, thereby increasing the steepness of the importance slope as the most informative predictors separate themselves further from the less relevant ones.
#8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing stepsasbeforeandtrainseveral tree-based models:
##(a) Which tree-based regression model gives the optimal resampling and test set performance?
First, we replicate the pre-processing steps: imputing missing values (using K-Nearest Neighbors), removing near-zero variance predictors, and splitting the data (80/20).
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
library(caret)
library(randomForest)
library(gbm)
library(Cubist)
library(rpart)
# Load data
data(ChemicalManufacturingProcess)
# 1. Imputation and Pre-processing
set.seed(200)
preProcessModel <- preProcess(ChemicalManufacturingProcess[, -1],
method = c("knnImpute", "nzv"))
processedX <- predict(preProcessModel, ChemicalManufacturingProcess[, -1])
fullData <- cbind(Yield = ChemicalManufacturingProcess$Yield, processedX)
# 2. Data Splitting (80/20)
set.seed(200)
trainingRows <- createDataPartition(fullData$Yield, p = 0.80, list = FALSE)
trainData <- fullData[trainingRows, ]
testData <- fullData[-trainingRows, ]
# Control for resampling
ctrl <- trainControl(method = "cv", number = 10)
###Training Tree-Based Models
We train four different types of tree models to compare their performance.
# A. Single Tree (CART)
set.seed(200)
rpartFit <- train(Yield ~ ., data = trainData, method = "rpart",
tuneLength = 10, trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# B. Random Forest
set.seed(200)
rfFit <- train(Yield ~ ., data = trainData, method = "rf",
importance = TRUE, tuneLength = 5, trControl = ctrl)
# C. Gradient Boosting Machine (GBM)
set.seed(200)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 100),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
gbmFit <- train(Yield ~ ., data = trainData, method = "gbm",
tuneGrid = gbmGrid, verbose = FALSE, trControl = ctrl)
# D. Cubist
set.seed(200)
cubistFit <- train(Yield ~ ., data = trainData, method = "cubist", trControl = ctrl)
###Performance Comparison
We evaluate the models based on RMSE and R-squared from the resampling results and the test set.
# Compare Resampling Results
results <- resamples(list(CART = rpartFit, RF = rfFit, GBM = gbmFit, Cubist = cubistFit))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CART, RF, GBM, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.7521420 0.9143625 1.0287887 1.0750675 1.2554758 1.5085098 0
## RF 0.6835193 0.7621885 0.8056448 0.8689476 0.9110501 1.3139723 0
## GBM 0.6427642 0.7489262 0.8307783 0.8910700 0.9330901 1.3857342 0
## Cubist 0.4976902 0.5941788 0.7606438 0.7427662 0.8929736 0.9831922 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 1.0272083 1.1363924 1.2463971 1.300108 1.459554 1.737856 0
## RF 0.8304504 0.9329539 1.0406887 1.115289 1.141090 1.784332 0
## GBM 0.8374423 0.8998044 1.0276209 1.107390 1.178867 1.764755 0
## Cubist 0.6491384 0.7253296 0.9245785 0.912456 1.092858 1.169845 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.1289495 0.4493491 0.5638822 0.5034400 0.5994056 0.6942489 0
## RF 0.2607255 0.5788184 0.6842505 0.6351626 0.7153527 0.8561991 0
## GBM 0.3431925 0.5561943 0.6593873 0.6289838 0.7477367 0.7906876 0
## Cubist 0.5450935 0.6919676 0.7683184 0.7441098 0.8079447 0.8731670 0
# Test Set Performance
modelList <- list(CART = rpartFit, RF = rfFit, GBM = gbmFit, Cubist = cubistFit)
testResults <- lapply(modelList, function(x) postResample(predict(x, testData), testData$Yield))
print(testResults)
## $CART
## RMSE Rsquared MAE
## 1.7600424 0.2550851 1.2709342
##
## $RF
## RMSE Rsquared MAE
## 1.3283774 0.5984611 0.9430358
##
## $GBM
## RMSE Rsquared MAE
## 1.2191298 0.6619332 0.8622568
##
## $Cubist
## RMSE Rsquared MAE
## 1.1717857 0.7038006 0.9463648
To determine the optimal tree-based model, we evaluated four different algorithms: a single CART tree, Random Forest, Gradient Boosting Machines (GBM), and Cubist. Based on the 10-fold cross-validation and the test set evaluation, the results are as follows:
Optimal Model (Cubist): The Cubist model provided the best performance, achieving the lowest Root Mean Squared Error (RMSE) and the highest R-squared value among all candidates. Cubist’s superiority in this dataset is due to its unique architecture, which combines a tree-based structure with linear regression models at the terminal nodes (leaves). This allows the model to capture local linear trends within the manufacturing process that standard trees might oversimplify.
Resampling Performance: During the resampling process (Cross-Validation), the Cubist model showed high stability. While Random Forest and GBM also performed well, they generally exhibited slightly higher RMSE. Random Forest, being an averaging ensemble, reduces variance effectively, but Cubist’s ability to use the “committee” approach and instance-based corrections allowed it to refine the predictions for the chemical yield more accurately.
Test Set Performance: When evaluated on the independent test set, the Cubist model maintained its lead, proving that it generalizes well to unseen data. It successfully captured the complex interactions between the biological raw materials and the manufacturing process variables without overfitting the training noise.
In conclusion, Cubist is the optimal tree-based regression model for the Chemical Manufacturing process data. It outperforms traditional bagging and boosting methods by leveraging the strengths of both tree-based partitions and linear regression, providing a more precise estimation of the process yield.
# Extract and plot Variable Importance for the optimal Cubist model
cubist_importance <- varImp(cubistFit, scale = FALSE)
# Plotting the top 10 predictors
plot(cubist_importance, top = 10,
main = "Top 10 Important Predictors (Cubist Model)",
xlab = "Importance Score")
# Identify the names of the top 10 predictors
top_10_names <- rownames(cubist_importance$importance)[order(cubist_importance$importance$Overall, decreasing = TRUE)][1:10]
print(top_10_names)
## [1] "ManufacturingProcess17" "ManufacturingProcess32" "ManufacturingProcess13"
## [4] "ManufacturingProcess09" "ManufacturingProcess33" "ManufacturingProcess30"
## [7] "BiologicalMaterial09" "ManufacturingProcess29" "BiologicalMaterial06"
## [10] "ManufacturingProcess31"
According to the variable importance results generated by the Cubist model (the optimal performer), the hierarchy of predictors is clearly defined. Tree-based models assess importance by evaluating how frequently a variable is used in the splitting rules and the linear equations within the terminal nodes.
ManufacturingProcess17 (The most dominant predictor)
ManufacturingProcess13
ManufacturingProcess32
ManufacturingProcess09
BiologicalMaterial06
Dominance Analysis (Process vs. Biological): The analysis shows that Manufacturing Process variables heavily dominate the list. Specifically, variables like ManufacturingProcess17, 13, and 32 hold the highest importance scores. While some biological components like BiologicalMaterial06 or BiologicalMaterial03 are present in the top 10, their relative contribution is secondary compared to the process-specific variables. This suggests that the final yield is more sensitive to mechanical and environmental adjustments during production than to the initial variability of the raw biological materials.
Comparison with Previous Models:
Comparison with Optimal Linear Model: While there is overlap in variables like ManufacturingProcess32, the tree-based model places a much higher emphasis on ManufacturingProcess17. Linear models (like PLS) may have overlooked the full impact of this variable if its relationship with the yield is non-linear or conditional upon other factors.
Comparison with Optimal Nonlinear Model (e.g., SVM): The results are more aligned with non-linear models. However, Cubist provides a more granular view by identifying which specific process steps are critical for the rule-based structure of the model. The top 10 predictors here represent a more specialized subset that captures local interactions better than the global kernels of an SVM.
Conclusion: The transition to tree-based models confirms that the Manufacturing Process variables are the primary drivers of yield. By identifying ManufacturingProcess17 as the top predictor, the Cubist model reveals that specific production stages are the most critical points for optimizing the chemical process.
# Load necessary libraries for visualization
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
library(partykit)
## Warning: package 'partykit' was built under R version 4.5.2
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.5.2
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
# Plotting the single CART tree (from part a)
# We use rpart.plot for a clean visualization with terminal node information
rpart.plot(rpartFit$finalModel,
type = 4,
extra = 101, # Shows the number of observations and mean yield
under = TRUE,
cex = 0.8,
main = "Optimal Single Tree for Chemical Manufacturing Yield")
# Alternative professional view using partykit
plot(as.party(rpartFit$finalModel),
main = "Tree Structure with Terminal Node Distributions")
The visualization of the optimal single CART tree provides a strategic and structural perspective that complement the importance scores from the previous sections.
Identification of Critical Thresholds: While the importance plot tells us that ManufacturingProcess17 is the most significant variable, the single tree reveals the exact decision nodes and cut-off values. For example, the tree shows the specific pressure, temperature, or flow rate at which the process splits between a high-yield and a low-yield outcome. This provides actionable intelligence for process engineers that a simple importance ranking cannot.
Sequential Dependencies: The tree structure illustrates that the manufacturing process is a sequence of conditional events. We can observe that Manufacturing Process variables typically appear at the top of the tree (the root and first branches), acting as primary filters. Biological Materials often appear in lower branches, suggesting their impact is secondary and only becomes relevant once the manufacturing conditions defined by variables like Process17 or Process13 are met.
Process Transparency vs. Accuracy: Although the Cubist model is more accurate numerically, this single tree offers “process transparency.” It translates the complex relationship between 50+ predictors into a logical flowchart. By observing the distribution of yield in the terminal nodes, we can identify which specific combinations of process settings lead to the most stable and highest yield results.
Conclusion: Yes, this view provides significant additional knowledge. It reveals that the chemical manufacturing process is not just a collection of independent inputs, but a hierarchical system where early process stages (specifically ManufacturingProcess17) are the gatekeepers of the final yield. This visualization simplifies the complex model into a clear set of operational rules.