#8.1. Recreate the simulated data from Exercise 7.2:

# Load libraries
library(mlbench)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

# Recreate the data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"


head(simulated)

##          V1        V2         V3         V4         V5         V6        V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
##          V8         V9       V10        y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817

#(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:Did the random forest model significantly use the uninformative predictors (V6 – V10)?

#Random Forest Model
set.seed(200)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)

##          Overall
## V1   8.605365900
## V2   6.831259165
## V3   0.741534943
## V4   7.883384091
## V5   2.244750293
## V6   0.136054182
## V7   0.055950944
## V8  -0.068195812
## V9   0.003196175
## V10 -0.054705900

#Comment: No—the random forest did not significantly use V6–V10.Their importance scores are negligible compared to V1–V5, which confirms that the model correctly identified the true signal and ignored the noise.

#(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9476651

#Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

# Fit new Random Forest Model
set.seed(200)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)

##                Overall
## V1          5.26426006
## V2          6.07175433
## V3          0.51040749
## V4          6.80567764
## V5          2.18127427
## V6          0.29434454
## V7         -0.01490498
## V8         -0.09461681
## V9          0.01769244
## V10        -0.07631531
## duplicate1  4.28678654

#Comment: Yes. The importance of V1 decreases after adding duplicate1. Because duplicate1 is highly correlated with V1, the random forest splits their predictive power between them. As a result, importance is shared, so V1 appears less important while duplicate1 gains importance.

# Add another correlated predictor

set.seed(200)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1

# Fit again
set.seed(200)
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)

##                Overall
## V1          4.67634284
## V2          6.31555205
## V3          0.49765769
## V4          6.97156846
## V5          2.28059725
## V6          0.26673432
## V7         -0.02119099
## V8         -0.05351292
## V9         -0.07305751
## V10         0.05244198
## duplicate1  2.92756105
## duplicate2  2.62811786

#Comment: Yes, the importance of V1 decreases further after adding duplicate2. The random forest distributes importance among all highly correlated variables (V1, duplicate1, duplicate2), so each individual variable appears less important even though they collectively represent the same information.

#(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

# Load package
install.packages("party")

## Installing package into 'C:/Users/zahid/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'party' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'party'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\zahid\AppData\Local\R\win-library\4.4\00LOCK\party\libs\x64\party.dll
## to C:\Users\zahid\AppData\Local\R\win-library\4.4\party\libs\x64\party.dll:
## Permission denied

## Warning: restored 'party'

## 
## The downloaded binary packages are in
##  C:\Users\zahid\AppData\Local\Temp\RtmpOgfkzB\downloaded_packages

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

# Fit conditional inference random forest
set.seed(205)

cforest_model <- cforest(
  formula = y ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10,
  data = simulated,
  controls = cforest_control(ntree = 1000, mtry = 3)
)

# View model
cforest_model

## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  1000 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10 
## Number of observations:  200

# Traditional importance
varimp(cforest_model, conditional = FALSE)

##           V1           V2           V3           V4           V5           V6 
##  6.727925636  5.049566124  0.074273724  6.702425113  1.731125769 -0.001331295 
##           V7           V8           V9          V10 
##  0.086768108 -0.043730153 -0.044603667 -0.016271735

# Conditional importance (corrected for correlation)
varimp(cforest_model, conditional = TRUE)

##          V1          V2          V3          V4          V5          V6 
##  3.01288795  3.24711183  0.05342356  4.17189774  0.77684080  0.02352628 
##          V7          V8          V9         V10 
##  0.04478180 -0.02570109  0.01613121 -0.01403483

#Comment: No, the importance patterns are not exactly the same. The conditional inference forest reduces bias from correlated predictors, so the importance of variables is more stable and less spread across correlated features compared to the traditional random forest. Both methods identify V1, V2, and V4 as important, but conditional importance provides a clearer ranking of the true predictors.

#(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

1. Boosted Trees (GBM)

# Load libraries
install.packages("Cubist")

## Installing package into 'C:/Users/zahid/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'Cubist' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'Cubist'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\zahid\AppData\Local\R\win-library\4.4\00LOCK\Cubist\libs\x64\Cubist.dll
## to C:\Users\zahid\AppData\Local\R\win-library\4.4\Cubist\libs\x64\Cubist.dll:
## Permission denied

## Warning: restored 'Cubist'

## 
## The downloaded binary packages are in
##  C:\Users\zahid\AppData\Local\Temp\RtmpOgfkzB\downloaded_packages

library(gbm)

## Loaded gbm 2.2.3

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(Cubist)

set.seed(200)

gbm_model <- train(
  y ~ .,
  data = simulated,
  method = "gbm",
  verbose = FALSE
)

# Variable importance (GBM)
gbm_varimp <- varImp(gbm_model, scale = FALSE)
gbm_varimp

## gbm variable importance
## 
##            Overall
## V4         4796.52
## V2         3154.03
## V1         2437.41
## V5         1968.54
## V3         1425.90
## duplicate2 1244.27
## duplicate1  981.39
## V6          173.16
## V7          139.79
## V8           89.90
## V9           71.99
## V10          69.05

#Comment: Yes, GBM shows a similar pattern: V1–V5 are the most important predictors, duplicate variables receive moderate importance, and noise variables are least important, though GBM ranks the main predictors more clearly than random forests.

#2. Cubist Model

set.seed(200)

cubist_model <- train(
  y ~ .,
  data = simulated,
  method = "cubist"
)

# Variable importance (Cubist)
cubist_varimp <- varImp(cubist_model, scale = FALSE)
cubist_varimp

## cubist variable importance
## 
##            Overall
## V1            69.5
## V2            56.5
## V4            49.0
## V3            42.5
## V5            37.0
## duplicate1     4.5
## V6             0.0
## V8             0.0
## V10            0.0
## duplicate2     0.0
## V9             0.0
## V7             0.0

#Comment: Yes, a similar overall pattern occurs, but Cubist shows the clearest separation. The true predictors (V1–V5) dominate the model, while noise variables (V6–V10) have near-zero importance. Unlike random forests and GBM, Cubist is less affected by correlated predictors, so duplicate variables receive very little importance.

#8.2. Use a simulation to show tree bias with different granularities.

# Load necessary libraries
library(rpart)
library(rpart.plot)

set.seed(123)
n <- 1000

# 1. Create the data
# Y is a random binary outcome
y <- factor(sample(c("A", "B"), n, replace = TRUE))

# X_high: Continuous noise (high granularity)
x_high <- runif(n)

# X_low: Binary noise (low granularity)
x_low <- sample(c(0, 1), n, replace = TRUE)

df <- data.frame(y, x_high, x_low)

# 2. Run a simulation to see how often each is chosen as the first split
results <- replicate(500, {
  # Re-randomize Y every time so there is NO true relationship
  df$y <- factor(sample(c("A", "B"), n, replace = TRUE))
  
  # Fit a shallow tree
  tree <- rpart(y ~ x_high + x_low, data = df, 
                control = rpart.control(maxdepth = 1, cp = 0))
  
  # Return which variable was used for the first split
  as.character(tree$frame$var[1])
})

# 3. View the bias
table(results)

## results
## <leaf> x_high  x_low 
##    113    366     21

#Comment: This simulation demonstrates the bias of decision trees toward predictors with higher granularity. We generated a dataset where the response variable is random and independent of predictors. Two predictors were used: - X_high: Continuous variable (many split points) - X_low: Binary variable (one split point) Since the response is random, neither predictor has a true relationship with the outcome. We ran 500 simulations fitting a shallow decision tree (depth = 1) and recorded which variable was selected for the first split. Results showed that the continuous variable (X_high) was selected far more often than the binary variable (X_low). This occurs because continuous variables offer many possible split points, increasing the chance of finding a split that reduces impurity by random chance. Decision trees are biased toward predictors with more potential split points, even when no true relationship exists.

# Count results (remove NA or <leaf> if present)
counts <- table(results)
counts <- counts[names(counts) != "<leaf>"]  # optional cleanup

# Bar plot
barplot(counts,
        main = "Variable Selection Frequency (Tree Bias Simulation)",
        xlab = "Predictor",
        ylab = "Number of Times Selected",
        col = "lightblue",
        border = "black")

#Comment: The bar plot shows that the continuous variable X_high is selected far more frequently than the binary variable X_low , despite both having no true relationship with the response. This visual evidence reinforces the conclusion that decision trees are biased toward predictors with more possible split points.

#8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance.

#(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

## Right model concentrates importance because it likely uses larger learning rate / stronger fitting per tree → few predictors dominate early. Left model uses slower learning → more predictors contribute gradually.

#(b) Which model do you think would be more predictive of other samples?


##Left model is typically more generalizable because it reduces over-reliance on a few predictors and spreads learning across many trees.

#(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

##Increasing interaction depth flattens the importance distribution because more predictors participate in interactions and share explanatory power.

#8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(caret)
library(AppliedPredictiveModeling)
library(rpart)
library(rpart.plot)
library(gbm)
library(randomForest)
library(ipred)   # for bagging
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:party':
## 
##     where

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(123)

#Data Loading

data(ChemicalManufacturingProcess)

# Check structure
str(ChemicalManufacturingProcess)

## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

#Impute missing values & Preprocessing

preProc <- preProcess(ChemicalManufacturingProcess, method = "medianImpute")

chem_data <- predict(preProc, ChemicalManufacturingProcess)

#Train/Test split

set.seed(123)
trainIndex <- createDataPartition(chem_data$Yield, p = 0.8, list = FALSE)

train_data <- chem_data[trainIndex, ]
test_data  <- chem_data[-trainIndex, ]

#Train multiple tree-based models

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

#Single Tree

set.seed(123)
tree_model <- train(
  Yield ~ ., data = train_data,
  method = "rpart",
  trControl = ctrl,
  tuneLength = 10
)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

#Bagging

set.seed(123)
bag_model <- train(
  Yield ~ ., data = train_data,
  method = "treebag",
  trControl = ctrl
)

#Random Forest

set.seed(123)
rf_model <- train(
  Yield ~ ., data = train_data,
  method = "rf",
  trControl = ctrl,
  tuneLength = 5
)

#GBM

set.seed(123)
gbm_model <- train(
  Yield ~ ., data = train_data,
  method = "gbm",
  trControl = ctrl,
  tuneLength = 5,
  verbose = FALSE
)

#Compare Models

resamps <- resamples(list(
  Tree = tree_model,
  Bagging = bag_model,
  RF = rf_model,
  GBM = gbm_model
))

summary(resamps)

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: Tree, Bagging, RF, GBM 
## Number of resamples: 15 
## 
## MAE 
##              Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## Tree    0.7745962 0.9091955 1.0360495 1.0378669 1.1217882 1.367055    0
## Bagging 0.6193870 0.8392414 0.8631295 0.9468185 1.0510798 1.361668    0
## RF      0.5974634 0.7882993 0.9159233 0.8968593 0.9640028 1.236529    0
## GBM     0.6546623 0.8814852 0.9289042 0.9371222 1.0147650 1.238478    0
## 
## RMSE 
##              Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Tree    1.1244019 1.265769 1.347449 1.373317 1.517252 1.813484    0
## Bagging 0.7783261 1.110260 1.233130 1.255362 1.338752 1.807642    0
## RF      0.7051958 1.031407 1.145138 1.168946 1.250480 1.641251    0
## GBM     0.7734493 1.121092 1.195685 1.206292 1.297348 1.593903    0
## 
## Rsquared 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Tree    0.3459136 0.3913288 0.4932775 0.4853639 0.5588593 0.7193093    0
## Bagging 0.3298106 0.5001703 0.5669150 0.5491661 0.5965383 0.7787947    0
## RF      0.4454872 0.5844404 0.6261969 0.6241108 0.6608936 0.8092395    0
## GBM     0.3668440 0.5589072 0.5962755 0.5993646 0.6606197 0.7868107    0

dotplot(resamps)

#Test set performance

models <- list(tree_model, bag_model, rf_model, gbm_model)
names(models) <- c("Tree", "Bagging", "RF", "GBM")

test_results <- lapply(models, function(mod) {
  preds <- predict(mod, test_data)
  postResample(preds, test_data$Yield)
})

test_results

## $Tree
##      RMSE  Rsquared       MAE 
## 1.8552995 0.1699285 1.3315952 
## 
## $Bagging
##     RMSE Rsquared      MAE 
## 1.397220 0.419397 1.034465 
## 
## $RF
##      RMSE  Rsquared       MAE 
## 1.2760958 0.5320198 0.9841074 
## 
## $GBM
##      RMSE  Rsquared       MAE 
## 1.2520131 0.5467897 0.9738349

#(a) Which tree-based regression model gives the optimal resampling and test set performance?

#(a)- Answer: The Gradient Boosting Machine (GBM) achieved the best performance, with the lowest RMSE (1.252) and highest R² (0.547) on the test set. Random Forest performed similarly but was slightly less accurate, while bagging and the single tree performed substantially worse.

#(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(gbm_model)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess31  24.019
## BiologicalMaterial06    19.152
## ManufacturingProcess13  17.969
## ManufacturingProcess17  17.462
## BiologicalMaterial03    15.418
## BiologicalMaterial12    14.296
## ManufacturingProcess06  12.745
## ManufacturingProcess09  12.138
## BiologicalMaterial09     8.787
## ManufacturingProcess25   6.257
## ManufacturingProcess01   6.208
## ManufacturingProcess37   5.760
## ManufacturingProcess27   4.957
## BiologicalMaterial10     3.847
## ManufacturingProcess21   3.667
## ManufacturingProcess39   3.622
## ManufacturingProcess36   2.644
## ManufacturingProcess03   2.438
## ManufacturingProcess44   2.404

plot(varImp(gbm_model), top = 20)

#(b) - Answer: ManufacturingProcess32 is the most important predictor, dominating all others and indicating a strong effect on yield. Other key variables include ManufacturingProcess31, ManufacturingProcess13, ManufacturingProcess17, and BiologicalMaterial06. Overall, process variables dominate the importance rankings, suggesting yield is mainly driven by process conditions. Compared to earlier models, GBM shows some overlap but also highlights additional predictors due to capturing nonlinearities and interactions.

#(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpart.plot(tree_model$finalModel)

#(c) Answer: The decision tree shows that ManufacturingProcess32 is the primary driver of yield, appearing at the top split. Process variables dominate the early splits, confirming their strong influence, while biological variables appear in later splits, indicating conditional effects. The tree reveals clear threshold effects and interactions between variables, providing interpretable rules about how yield changes under different conditions.

Data 624: Home Work 9

Mohammad Zahid Chowdhury

2026-05-03

1. Boosted Trees (GBM)