#8.1. Recreate the simulated data from Exercise 7.2:
# Load libraries
library(mlbench)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
# Recreate the data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
head(simulated)
## V1 V2 V3 V4 V5 V6 V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
## V8 V9 V10 y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817
#(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:Did the random forest model significantly use the uninformative predictors (V6 – V10)?
#Random Forest Model
set.seed(200)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)
## Overall
## V1 8.605365900
## V2 6.831259165
## V3 0.741534943
## V4 7.883384091
## V5 2.244750293
## V6 0.136054182
## V7 0.055950944
## V8 -0.068195812
## V9 0.003196175
## V10 -0.054705900
#Comment: No—the random forest did not significantly use V6–V10.Their importance scores are negligible compared to V1–V5, which confirms that the model correctly identified the true signal and ignored the noise.
#(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9476651
#Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
# Fit new Random Forest Model
set.seed(200)
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
## Overall
## V1 5.26426006
## V2 6.07175433
## V3 0.51040749
## V4 6.80567764
## V5 2.18127427
## V6 0.29434454
## V7 -0.01490498
## V8 -0.09461681
## V9 0.01769244
## V10 -0.07631531
## duplicate1 4.28678654
#Comment: Yes. The importance of V1 decreases after adding duplicate1. Because duplicate1 is highly correlated with V1, the random forest splits their predictive power between them. As a result, importance is shared, so V1 appears less important while duplicate1 gains importance.
# Add another correlated predictor
set.seed(200)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1
# Fit again
set.seed(200)
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
## Overall
## V1 4.67634284
## V2 6.31555205
## V3 0.49765769
## V4 6.97156846
## V5 2.28059725
## V6 0.26673432
## V7 -0.02119099
## V8 -0.05351292
## V9 -0.07305751
## V10 0.05244198
## duplicate1 2.92756105
## duplicate2 2.62811786
#Comment: Yes, the importance of V1 decreases further after adding duplicate2. The random forest distributes importance among all highly correlated variables (V1, duplicate1, duplicate2), so each individual variable appears less important even though they collectively represent the same information.
#(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
# Load package
install.packages("party")
## Installing package into 'C:/Users/zahid/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'party' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'party'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\zahid\AppData\Local\R\win-library\4.4\00LOCK\party\libs\x64\party.dll
## to C:\Users\zahid\AppData\Local\R\win-library\4.4\party\libs\x64\party.dll:
## Permission denied
## Warning: restored 'party'
##
## The downloaded binary packages are in
## C:\Users\zahid\AppData\Local\Temp\RtmpOgfkzB\downloaded_packages
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
# Fit conditional inference random forest
set.seed(205)
cforest_model <- cforest(
formula = y ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10,
data = simulated,
controls = cforest_control(ntree = 1000, mtry = 3)
)
# View model
cforest_model
##
## Random Forest using Conditional Inference Trees
##
## Number of trees: 1000
##
## Response: y
## Inputs: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10
## Number of observations: 200
# Traditional importance
varimp(cforest_model, conditional = FALSE)
## V1 V2 V3 V4 V5 V6
## 6.727925636 5.049566124 0.074273724 6.702425113 1.731125769 -0.001331295
## V7 V8 V9 V10
## 0.086768108 -0.043730153 -0.044603667 -0.016271735
# Conditional importance (corrected for correlation)
varimp(cforest_model, conditional = TRUE)
## V1 V2 V3 V4 V5 V6
## 3.01288795 3.24711183 0.05342356 4.17189774 0.77684080 0.02352628
## V7 V8 V9 V10
## 0.04478180 -0.02570109 0.01613121 -0.01403483
#Comment: No, the importance patterns are not exactly the same. The conditional inference forest reduces bias from correlated predictors, so the importance of variables is more stable and less spread across correlated features compared to the traditional random forest. Both methods identify V1, V2, and V4 as important, but conditional importance provides a clearer ranking of the true predictors.
#(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# Load libraries
install.packages("Cubist")
## Installing package into 'C:/Users/zahid/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'Cubist' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'Cubist'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\zahid\AppData\Local\R\win-library\4.4\00LOCK\Cubist\libs\x64\Cubist.dll
## to C:\Users\zahid\AppData\Local\R\win-library\4.4\Cubist\libs\x64\Cubist.dll:
## Permission denied
## Warning: restored 'Cubist'
##
## The downloaded binary packages are in
## C:\Users\zahid\AppData\Local\Temp\RtmpOgfkzB\downloaded_packages
library(gbm)
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(Cubist)
set.seed(200)
gbm_model <- train(
y ~ .,
data = simulated,
method = "gbm",
verbose = FALSE
)
# Variable importance (GBM)
gbm_varimp <- varImp(gbm_model, scale = FALSE)
gbm_varimp
## gbm variable importance
##
## Overall
## V4 4796.52
## V2 3154.03
## V1 2437.41
## V5 1968.54
## V3 1425.90
## duplicate2 1244.27
## duplicate1 981.39
## V6 173.16
## V7 139.79
## V8 89.90
## V9 71.99
## V10 69.05
#Comment: Yes, GBM shows a similar pattern: V1–V5 are the most important predictors, duplicate variables receive moderate importance, and noise variables are least important, though GBM ranks the main predictors more clearly than random forests.
#2. Cubist Model
set.seed(200)
cubist_model <- train(
y ~ .,
data = simulated,
method = "cubist"
)
# Variable importance (Cubist)
cubist_varimp <- varImp(cubist_model, scale = FALSE)
cubist_varimp
## cubist variable importance
##
## Overall
## V1 69.5
## V2 56.5
## V4 49.0
## V3 42.5
## V5 37.0
## duplicate1 4.5
## V6 0.0
## V8 0.0
## V10 0.0
## duplicate2 0.0
## V9 0.0
## V7 0.0
#Comment: Yes, a similar overall pattern occurs, but Cubist shows the clearest separation. The true predictors (V1–V5) dominate the model, while noise variables (V6–V10) have near-zero importance. Unlike random forests and GBM, Cubist is less affected by correlated predictors, so duplicate variables receive very little importance.
#8.2. Use a simulation to show tree bias with different granularities.
# Load necessary libraries
library(rpart)
library(rpart.plot)
set.seed(123)
n <- 1000
# 1. Create the data
# Y is a random binary outcome
y <- factor(sample(c("A", "B"), n, replace = TRUE))
# X_high: Continuous noise (high granularity)
x_high <- runif(n)
# X_low: Binary noise (low granularity)
x_low <- sample(c(0, 1), n, replace = TRUE)
df <- data.frame(y, x_high, x_low)
# 2. Run a simulation to see how often each is chosen as the first split
results <- replicate(500, {
# Re-randomize Y every time so there is NO true relationship
df$y <- factor(sample(c("A", "B"), n, replace = TRUE))
# Fit a shallow tree
tree <- rpart(y ~ x_high + x_low, data = df,
control = rpart.control(maxdepth = 1, cp = 0))
# Return which variable was used for the first split
as.character(tree$frame$var[1])
})
# 3. View the bias
table(results)
## results
## <leaf> x_high x_low
## 113 366 21
#Comment: This simulation demonstrates the bias of decision trees toward predictors with higher granularity. We generated a dataset where the response variable is random and independent of predictors. Two predictors were used: - X_high: Continuous variable (many split points) - X_low: Binary variable (one split point) Since the response is random, neither predictor has a true relationship with the outcome. We ran 500 simulations fitting a shallow decision tree (depth = 1) and recorded which variable was selected for the first split. Results showed that the continuous variable (X_high) was selected far more often than the binary variable (X_low). This occurs because continuous variables offer many possible split points, increasing the chance of finding a split that reduces impurity by random chance. Decision trees are biased toward predictors with more potential split points, even when no true relationship exists.
# Count results (remove NA or <leaf> if present)
counts <- table(results)
counts <- counts[names(counts) != "<leaf>"] # optional cleanup
# Bar plot
barplot(counts,
main = "Variable Selection Frequency (Tree Bias Simulation)",
xlab = "Predictor",
ylab = "Number of Times Selected",
col = "lightblue",
border = "black")
#Comment: The bar plot shows that the continuous variable X_high is selected far more frequently than the binary variable X_low , despite both having no true relationship with the response. This visual evidence reinforces the conclusion that decision trees are biased toward predictors with more possible split points.
#8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance.
#(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
## Right model concentrates importance because it likely uses larger learning rate / stronger fitting per tree → few predictors dominate early. Left model uses slower learning → more predictors contribute gradually.
#(b) Which model do you think would be more predictive of other samples?
##Left model is typically more generalizable because it reduces over-reliance on a few predictors and spreads learning across many trees.
#(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
##Increasing interaction depth flattens the importance distribution because more predictors participate in interactions and share explanatory power.
#8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(caret)
library(AppliedPredictiveModeling)
library(rpart)
library(rpart.plot)
library(gbm)
library(randomForest)
library(ipred) # for bagging
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(123)
#Data Loading
data(ChemicalManufacturingProcess)
# Check structure
str(ChemicalManufacturingProcess)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
#Impute missing values & Preprocessing
preProc <- preProcess(ChemicalManufacturingProcess, method = "medianImpute")
chem_data <- predict(preProc, ChemicalManufacturingProcess)
#Train/Test split
set.seed(123)
trainIndex <- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)
train_data <- chem_data[trainIndex, ]
test_data <- chem_data[-trainIndex, ]
#Train multiple tree-based models
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
#Single Tree
set.seed(123)
tree_model <- train(
Yield ~ ., data = train_data,
method = "rpart",
trControl = ctrl,
tuneLength = 10
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#Bagging
set.seed(123)
bag_model <- train(
Yield ~ ., data = train_data,
method = "treebag",
trControl = ctrl
)
#Random Forest
set.seed(123)
rf_model <- train(
Yield ~ ., data = train_data,
method = "rf",
trControl = ctrl,
tuneLength = 5
)
#GBM
set.seed(123)
gbm_model <- train(
Yield ~ ., data = train_data,
method = "gbm",
trControl = ctrl,
tuneLength = 5,
verbose = FALSE
)
#Compare Models
resamps <- resamples(list(
Tree = tree_model,
Bagging = bag_model,
RF = rf_model,
GBM = gbm_model
))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: Tree, Bagging, RF, GBM
## Number of resamples: 15
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.7745962 0.9091955 1.0360495 1.0378669 1.1217882 1.367055 0
## Bagging 0.6193870 0.8392414 0.8631295 0.9468185 1.0510798 1.361668 0
## RF 0.5974634 0.7882993 0.9159233 0.8968593 0.9640028 1.236529 0
## GBM 0.6546623 0.8814852 0.9289042 0.9371222 1.0147650 1.238478 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 1.1244019 1.265769 1.347449 1.373317 1.517252 1.813484 0
## Bagging 0.7783261 1.110260 1.233130 1.255362 1.338752 1.807642 0
## RF 0.7051958 1.031407 1.145138 1.168946 1.250480 1.641251 0
## GBM 0.7734493 1.121092 1.195685 1.206292 1.297348 1.593903 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Tree 0.3459136 0.3913288 0.4932775 0.4853639 0.5588593 0.7193093 0
## Bagging 0.3298106 0.5001703 0.5669150 0.5491661 0.5965383 0.7787947 0
## RF 0.4454872 0.5844404 0.6261969 0.6241108 0.6608936 0.8092395 0
## GBM 0.3668440 0.5589072 0.5962755 0.5993646 0.6606197 0.7868107 0
dotplot(resamps)
#Test set performance
models <- list(tree_model, bag_model, rf_model, gbm_model)
names(models) <- c("Tree", "Bagging", "RF", "GBM")
test_results <- lapply(models, function(mod) {
preds <- predict(mod, test_data)
postResample(preds, test_data$Yield)
})
test_results
## $Tree
## RMSE Rsquared MAE
## 1.8552995 0.1699285 1.3315952
##
## $Bagging
## RMSE Rsquared MAE
## 1.397220 0.419397 1.034465
##
## $RF
## RMSE Rsquared MAE
## 1.2760958 0.5320198 0.9841074
##
## $GBM
## RMSE Rsquared MAE
## 1.2520131 0.5467897 0.9738349
#(a) Which tree-based regression model gives the optimal resampling and test set performance?
#(a)- Answer: The Gradient Boosting Machine (GBM) achieved the best performance, with the lowest RMSE (1.252) and highest R² (0.547) on the test set. Random Forest performed similarly but was slightly less accurate, while bagging and the single tree performed substantially worse.
#(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(gbm_model)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess31 24.019
## BiologicalMaterial06 19.152
## ManufacturingProcess13 17.969
## ManufacturingProcess17 17.462
## BiologicalMaterial03 15.418
## BiologicalMaterial12 14.296
## ManufacturingProcess06 12.745
## ManufacturingProcess09 12.138
## BiologicalMaterial09 8.787
## ManufacturingProcess25 6.257
## ManufacturingProcess01 6.208
## ManufacturingProcess37 5.760
## ManufacturingProcess27 4.957
## BiologicalMaterial10 3.847
## ManufacturingProcess21 3.667
## ManufacturingProcess39 3.622
## ManufacturingProcess36 2.644
## ManufacturingProcess03 2.438
## ManufacturingProcess44 2.404
plot(varImp(gbm_model), top = 20)
#(b) - Answer: ManufacturingProcess32 is the most important predictor, dominating all others and indicating a strong effect on yield. Other key variables include ManufacturingProcess31, ManufacturingProcess13, ManufacturingProcess17, and BiologicalMaterial06. Overall, process variables dominate the importance rankings, suggesting yield is mainly driven by process conditions. Compared to earlier models, GBM shows some overlap but also highlights additional predictors due to capturing nonlinearities and interactions.
#(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpart.plot(tree_model$finalModel)
#(c) Answer: The decision tree shows that ManufacturingProcess32 is the primary driver of yield, appearing at the top split. Process variables dominate the early splits, confirming their strong influence, while biological variables appear in later splits, indicating conditional effects. The tree reveals clear threshold effects and interactions between variables, providing interpretable rules about how yield changes under different conditions.