Data 624: HW 9

Nakesha Fray

2025-04-15

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1. Recreate the simulated data from Exercise 7.2:

set.seed(200)
  simulated <- mlbench.friedman1(200, sd = 1)
  simulated <- cbind(simulated$x, simulated$y)
  simulated <- as.data.frame(simulated)
  colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

set.seed(200)
model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
                       rfImp1 <- varImp(model1, scale = FALSE)
rfImp1 |>
  arrange(desc(Overall))
##          Overall
## V1   8.605365900
## V4   7.883384091
## V2   6.831259165
## V5   2.244750293
## V3   0.741534943
## V6   0.136054182
## V7   0.055950944
## V9   0.003196175
## V10 -0.054705900
## V8  -0.068195812

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The random forest model showed that the variable of most importance was V1 with an importance score of 8.605365900, followed by V4 with a score 7.883384091, then V2 with a score of 6.831259165. Essentially, the top 5 predictors were V1 to V5, but not in that order, which means they had the most influence on y. The random forest model did not significantly use the uninformative predictors (V6 – V10) since these predictors had importance scores less than zero or of negative values. The uninformative predictors (V6 – V10) therefore did not contribute much to preciting y, which is probably what we wanted since the model is essentially diminishing noise predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9497025

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

When a new predictor that is highly correlated with V1 is added to the model, the importance score for V1 decreases. In the updated random forest model, the most important variable is now V4 with an importance score of 6.80, followed by V2 (6.20), V1 (6.05), and the new correlated variable, duplicate1 (4.21). This demonstrates that adding a highly correlated predictor dilutes the importance of V1, as the model splits the predictive contribution between V1 and its duplicate.

set.seed(200)
model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
                       rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
  arrange(desc(Overall))
##                Overall
## V4          6.79689905
## V2          6.19650787
## V1          6.05096011
## duplicate1  4.20641786
## V5          2.26790879
## V3          0.52249125
## V6          0.19172138
## V7          0.03766832
## V9          0.02041124
## V10        -0.04481192
## V8         -0.08406511

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importance show the same pattern as the traditional random forest model?

The variable importance scores from the conditional inference forest (using cforest) do not show the exact same pattern as those from the traditional random forest model, though they are somewhat similar. In the traditional random forest, the most important variables were V4, V2, V1, and duplicate1, with importance scores of 6.797, 6.197, 6.051, and 4.206, respectively. As seen in the previous question, this suggests that the importance of V1 was diluted due to its high correlation with duplicate1. On the other hand, the conditional inference forest (with conditional = TRUE) also identified V4, V2, V1, and duplicate1 as the top predictors — but with lower importance scores: 5.971, 4.809, 3.991, and 2.300, respectively. Also, while the top scores in the traditional model were relatively close together, the scores in the conditional model are more spaced out, which shows clearer distinctions in how each predictor influences the model. This difference clearly shows that setting conditional = TRUE in the varimp() function adjusts for multicollinearity, which provides a more accurate estimate of each predictor’s contribution compared to the traditional random forest model.

set.seed(200)

cforest_model <- cforest(y~., data=simulated)

imp_conditional <- varimp(cforest_model, conditional = TRUE)
sort(imp_conditional, decreasing = TRUE)
##          V4          V2          V1  duplicate1          V5          V3 
##  5.97100449  4.80864960  3.99082208  2.30033386  1.59907090 -0.01130152 
##         V10          V8          V7          V6          V9 
## -0.13144492 -0.16024120 -0.16318684 -0.27500439 -0.32266033

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The boosted trees and Cubist models show somewhat different patterns of predictor importance compared to both the traditional and conditional random forest models. As seen previously, the traditional random forest showed dilution of importance between V1 and duplicate1 due to their high correlation. The conditional inference forest (cforest with conditional = TRUE) adjusted for this, resulting in lower importance scores for both V1 and duplicate1. Here, the boosted trees model (gbm) still assigned meaningful importance to both V1 and duplicate1, suggesting some sensitivity to multicollinearity, though it may have handle it slightly better than the traditional random forest. Finally, the Cubist model did not use duplicate1 at all, instead assigning 100% usage to the key informative predictors (V1, V2, V4, and V5). This suggests that Cubist may be more robust to multicollinearity, effectively ignoring redundant variables.

#boosted trees
boost <- gbm(y ~ ., data = simulated, distribution = "gaussian")
             
boost_imp <- summary(boost, plotit = FALSE)
boost_imp
##                   var    rel.inf
## V4                 V4 30.1741369
## V2                 V2 21.1581888
## V1                 V1 20.2592091
## V5                 V5 12.2255615
## V3                 V3  8.5876086
## duplicate1 duplicate1  6.9970857
## V6                 V6  0.4440209
## V7                 V7  0.1541886
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000
#cubist
cubist_model <- cubist(x = simulated[, -which(names(simulated) == "y")],
                       y = simulated$y)

summary(cubist_model)
## 
## Call:
## cubist.default(x = simulated[, -which(names(simulated) == "y")], y
##  = simulated$y)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Sun Apr 27 19:19:33 2025
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 200 cases (12 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
## 
##  outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
## 
## 
## Evaluation on training data (200 cases):
## 
##     Average  |error|           2.125882
##     Relative |error|               0.53
##     Correlation coefficient        0.85
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##           100%    V1
##           100%    V2
##           100%    V4
##           100%    V5
## 
## 
## Time: 0.0 secs
cubist_model$usage
##    Conditions Model   Variable
## 1           0   100         V1
## 2           0   100         V2
## 3           0   100         V4
## 4           0   100         V5
## 5           0     0         V3
## 6           0     0         V6
## 7           0     0         V7
## 8           0     0         V8
## 9           0     0         V9
## 10          0     0        V10
## 11          0     0 duplicate1

8.2. Use a simulation to show tree bias with different granularities.

This demonstrates how the bias of decision trees varies with different levels of complexity. The plot illustrates that more granular complex models tend to fit the training data too closely, leading to overfitting. On the other hand, simpler less granular models may underfit by failing to capture the data’s patterns. Finding a balance between these extremes helps avoid both underfitting and overfitting.

set.seed(1000)
n <- 1000  
a <- rnorm(n)  
b <- rnorm(n)  
c <- rnorm(n)
d <- rnorm(n)
y <- 3*a - 2*b + 1*c + 0.5*d + rnorm(n)
trainData <- data.frame(a, b, c, d, y)

trainIndex <- createDataPartition(trainData$y, p = 0.8, list = FALSE)
trainData <- trainData[trainIndex, ]
testData <- trainData[-trainIndex, ]

rpartTune <- train(
  y ~ ., 
  data = trainData,
  method = "rpart", 
  tuneLength = 10,  
  trControl = trainControl(method = "cv")  
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(rpartTune)

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right, with a learning rate and bagging fraction of 0.9, focuses its importance on just a few predictors because it trains quickly and aggressively. Since higher parameter values reduce the number of trees needed, the model quickly locks onto the strongest and more importance predictors and uses them repeatedly. This causes the top predictors to dominate the importance scores while others of lower imporance are not used. The model on the left, with a learning rate and bagging fraction of 0.1, learns more slowly and less aggressively, since it is much lower that the parameters in the model on the right. It requires more trees to reach a good model and therefore explores a wider range of predictors — even those with weaker importance.

(b) Which model do you think would be more predictive of other samples?

Since Ridgeway (2007) suggests that smaller values of the learning rate (such as less than 0.01) tend to produce better generalization, the model on the left, which uses a lower learning rate and bagging fraction (both 0.1), is more likely to be predictive on new, unseen samples. A smaller learning rate means the model learns more gradually, requiring more trees to make progress, but this also helps it avoid overfitting based on noisy patterns in the data. On the other hand, the model on the right, with a high learning rate and high bagging fraction (both 0.9), trains much more aggressively, building fewer trees overall. While this might work well on the training data, it risks overfitting, and potentially missing other useful patterns in the data. Therefore, the model on the left, while slower to train will most likely builds a more cautious model that will perform well on new data.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would be increasing the depth of the tree, which would allow the model to look for more complex relationships between predictors. This would essentially affect the slope of predictor importance differently both models in Fig. 8.24. For the model on the left with a learning rate of 0.1 and bagging fraction of 0.1, increasing the interaction depth would make the model more flexible and better at finding patterns that it may have missed. This would result in the importance being spread more across the predictors, with more predictors getting some importance as the model explores more relationships between predictors. For the model on the right with a learning rate of 0.9 and bagging fraction of 0.9, increasing the interaction depth might cause it focus of the same few top predictors, but even more since the model is aggressive and quick. In other words, increasing the depth of the trees would still allow the risk of overfitting by as the importance scores will be even more concentrated on the top predictors. This tells us that while increasing the interaction depth makes the model more flexible, by look into more predictor relationships, how the model benefits depends on the parameter of the model.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data(ChemicalManufacturingProcess)
impute <- preProcess(ChemicalManufacturingProcess, "knnImpute")
bio <- predict(impute, ChemicalManufacturingProcess)

filtered_bio <- bio[, -nearZeroVar(bio)]

set.seed(1000)
splits2 <- createDataPartition(filtered_bio$Yield, p = .80, times = 1, list = FALSE)

training <- filtered_bio[splits2, ]
testing  <- filtered_bio[-splits2, ]

training_x <- training[, names(training) != "Yield"]
training_y <- training$Yield

test_x <- testing[, names(testing) != "Yield"]
test_y <- testing$Yield

Single Tree Model: The lowest RMSE for the model was found at cp = 0.01032311, which is the optimal complexity parameter determined during cross-validation. The final RMSE on the test set was 0.9256104, indicating that the model’s predictions are off by about this amount on average. The R-squared value of 0.3123065 means that the model explains about 31.23% of the variance in Yield. The MAE was 0.6881193, indicating that, on average, the predictions differ from the true values by approximately 0.69 units. Given these results, we can conclude that the model is not fully capturing the variability in the data. The relatively low R-squared and the RMSE suggest that the single decision tree may not be the most effective model for predicting the Yield variable in this dataset.

set.seed(200)
rpartTune <- train(training_x, training_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rpartTune
## CART 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01032311  0.7617520  0.4846094  0.6036067
##   0.01345907  0.7644723  0.4838346  0.6090378
##   0.01438438  0.7721981  0.4730818  0.6146393
##   0.02472886  0.7816848  0.4619764  0.6187762
##   0.02593820  0.7751396  0.4702799  0.6118116
##   0.03243377  0.7699273  0.4822457  0.6082320
##   0.05263714  0.7721189  0.4853148  0.6068205
##   0.07775048  0.7804466  0.4539438  0.6097480
##   0.09462110  0.8043232  0.4032911  0.6249575
##   0.42767145  0.9386815  0.2878156  0.7426912
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01032311.
rpartPred <- predict(rpartTune, newdata = test_x)
postResample(pred = rpartPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.9256104 0.3123065 0.6881193

Bagged Tree Model: The lowest RMSE was 0.7365350, indicating that, on average, the model’s predictions are off by approximately 0.7365 units. The R-squared value of 0.4619277 means that the model explains about 46.19% of the variance in the Yield. The MAE was 0.5814897, which means that, on average, the predictions differ from the true values by about 0.58 units. Given these results, we can conclude that the bagged tree model performs better than the single decision tree model. The lower RMSE and higher R-squared suggest that the bagging technique has improved the model’s predictive accuracy. However, while the bagged model improves on the single tree, the R-squared value is still relatively low, and the RMSE is still somewhat high, suggesting that the model does not fully capture the variability in the data. Therefore, while the bagged tree is better than the single tree, it may not be the most effective model for predicting Yield in this dataset.

set.seed(200)
baggedTree <- ipredbagg(training_y, training_x)

baggedTree
## 
## Bagging regression trees with 25 bootstrap replications
baggPred <- predict(baggedTree, newdata = test_x)
postResample(pred = baggPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.7365350 0.4619277 0.5814897

Random Forest: The optimal mtry value for the model was 56, and the final RMSE on the test set was 0.7161877, indicating that the model’s predictions are off by about 0.716 units on average. The R-squared value of 0.4890676 means that the model explains approximately 48.91% of the variance in the Yield variable. The MAE was 0.5491009, indicating that, on average, the model’s predictions differ from the true values by about 0.549 units. Compared to the bagged tree model, the random forest model has a slightly lower RMSE and slightly higher R-squared, suggesting better predictive performance than both the single tree and bagged tree models. However, the R-squared is still relatively low, meaning the model is not fully capturing the variability in the data. As a result, it may not be the most effective model for predicting the Yield variable in this dataset.

set.seed(200)
rforest <- train(training_x, training_y,
                  method = "rf",
                  tuneLength = 5,
                  trControl = trainControl(method = "cv"))

rforest
## Random Forest 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.6661915  0.6690380  0.5320680
##   15    0.5882910  0.7085044  0.4615041
##   29    0.5768726  0.7143063  0.4444740
##   42    0.5763108  0.7114132  0.4442758
##   56    0.5744258  0.7110812  0.4408374
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 56.
rfPred <- predict(rforest, newdata = test_x)
postResample(pred = rfPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.7161877 0.4890676 0.5491009

Boosted tree: The final model selected was based on n.trees = 250 and interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10. The final RMSE on the test set was 0.7013506, indicating that the model’s predictions are off by this amount on average. The R-squared value of 0.5841578 means that approximately 58.42% of the variance in the Yield variable is explained by the model. The MAE was 0.5706618, indicating that, on average, the model’s predictions differ from the true values by about 0.571 units. Compared to the other models, the boosted model shows improvements in both RMSE and R-squared compared to the single tree, bagged tree, and random forest models since it has a lower RMSE and a higher R-squared, which suggests better predictive accuracy. However, the R-squared is still relatively low, meaning the model is not fully capturing the variability in the data. As a result, it may not be the most effective model for predicting the Yield variable in this dataset.

set.seed(200)

boosted <- train(training_x, training_y,
                   method = "gbm",
                   verbose = FALSE,
                   tuneLength = 5,
                   trControl = trainControl(method = "cv"))

boosted
## Stochastic Gradient Boosting 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   1                   50      0.6153562  0.6532252  0.4805991
##   1                  100      0.5999571  0.6698003  0.4661477
##   1                  150      0.5977469  0.6717799  0.4616404
##   1                  200      0.5972810  0.6708698  0.4607287
##   1                  250      0.5972817  0.6726907  0.4597517
##   2                   50      0.6003909  0.6800836  0.4673940
##   2                  100      0.5851562  0.7003707  0.4488490
##   2                  150      0.5782965  0.7029438  0.4483112
##   2                  200      0.5767689  0.7019668  0.4467679
##   2                  250      0.5740112  0.7011128  0.4436661
##   3                   50      0.6072312  0.6663113  0.4727836
##   3                  100      0.5933423  0.6805577  0.4666605
##   3                  150      0.5916772  0.6825415  0.4659560
##   3                  200      0.5900410  0.6838075  0.4618457
##   3                  250      0.5827145  0.6911719  0.4562658
##   4                   50      0.5999634  0.6759866  0.4534527
##   4                  100      0.5855927  0.6889601  0.4428903
##   4                  150      0.5810369  0.6900550  0.4386034
##   4                  200      0.5821684  0.6884864  0.4396687
##   4                  250      0.5792359  0.6908354  0.4384591
##   5                   50      0.6042685  0.6702496  0.4713805
##   5                  100      0.6000414  0.6758406  0.4629567
##   5                  150      0.5942332  0.6825467  0.4555361
##   5                  200      0.5893463  0.6854017  0.4525172
##   5                  250      0.5890580  0.6856612  0.4538701
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.
boostPred <- predict(boosted, newdata = test_x)
postResample(pred = boostPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.7013506 0.5841578 0.5706618

Cubist Model: The best performance was obtained with committees = 10 and neighbors = 5. The final RMSE on the test set was 0.5777509, which means the model’s predictions are off by about this amount on average. The R-squared value of 0.6946831 indicates that the model explains about 69.47% of the variance in the Yield variable, showing a relatively strong relationship between the predictors and the target variable. The MAE was 0.4454734, meaning that, on average, the predictions differ from the true values by approximately 0.45 units. Compared to the previous models, the Cubist model demonstrates the best predictive performance (out of the models tested) with the lowest RMSE and a highest R-squared. The R-squared value indicates that the model is able to capture a relatively decent portion of the variability in the data, and the low RMSE suggests that it is making relatively accurate predictions.

set.seed(200)
cubist_model <- train(Yield ~ .,
                    data = training,
                    method = "cubist")

cubist_model
## Cubist 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.9243937  0.3858279  0.6548392
##    1          5          0.9099438  0.4039700  0.6394167
##    1          9          0.9137386  0.3977330  0.6429047
##   10          0          0.6665183  0.5685403  0.5131718
##   10          5          0.6550573  0.5848772  0.5002271
##   10          9          0.6604053  0.5774764  0.5053454
##   20          0          0.6379770  0.6006395  0.4925007
##   20          5          0.6262964  0.6165856  0.4797210
##   20          9          0.6322284  0.6089168  0.4853224
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
cubePred <- predict(cubist_model, newdata = test_x)
postResample(pred = cubePred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.5791355 0.6954971 0.4495403

(a) Which tree-based regression model gives the optimal resampling and test set performance?

For this question, I tested the single tree, bagged tree, random forest, booted tree, and cubist model. Based on the results of the various tree-based regression models, the Cubist model appears to give the best overall performance. The Cubist model has the lowest RMSE (0.5777509) and the highest R-squared (0.6947) among all the tree-based models. This suggests that it explains the most variance in the Yield variable and provides the most accurate predictions, making it the best model for resampling and test set performance amongst these models. For reference, these were the results from the other tree-based regression model: the single Tree: RMSE = 0.9256, R-squared = 0.3123 - Bagged Trees: RMSE = 0.7365, R-squared = 0.4619 - Random Forest: RMSE = 0.7162, R-squared = 0.4891 - Boosted Model: RMSE = 0.7014, R-squared = 0.5842

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The most important predictors in the non-linear SVM model that I have trained are:
ManufacturingProcess32
BiologicalMaterial06
ManufacturingProcess36
ManufacturingProcess13
BiologicalMaterial03
BiologicalMaterial12
ManufacturingProcess17
BiologicalMaterial02
ManufacturingProcess31
ManufacturingProcess09

The most important predictors from the optimal linear model was:
ManufacturingProcess32
ManufacturingProcess36
ManufacturingProcess13
ManufacturingProcess09
ManufacturingProcess17
ManufacturingProcess06
BiologicalMaterial02
BiologicalMaterial06
ManufacturingProcess11
ManufacturingProcess33

The most important predictors from the cubist tree-based regression model was:
ManufacturingProcess32
ManufacturingProcess17
ManufacturingProcess09
BiologicalMaterial03
ManufacturingProcess29
BiologicalMaterial06
ManufacturingProcess13
ManufacturingProcess27
BiologicalMaterial02
BiologicalMaterial12

The process predictors dominate the list compared to biological predictors for the non-linear, linear models, and tree based regression model.The key process predictors that appear across all models with a strong influence on Yield are: ManufacturingProcess32, ManufacturingProcess36, ManufacturingProcess13, and ManufacturingProcess17. For the Biological predictors, BiologicalMaterial06 and BiologicalMaterial03 were ranked high in the SVM and Cubist models, while BiologicalMaterial02 appeared in both the linear and Cubist models. The Cubist Model included prodictors that were not in the other models, such as ManufacturingProcess29 and ManufacturingProcess27.

varImp(cubist_model)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  79.798
## ManufacturingProcess09  36.364
## BiologicalMaterial03    35.354
## ManufacturingProcess29  31.313
## BiologicalMaterial06    29.293
## ManufacturingProcess13  29.293
## ManufacturingProcess27  27.273
## BiologicalMaterial02    27.273
## BiologicalMaterial12    26.263
## ManufacturingProcess26  22.222
## ManufacturingProcess25  22.222
## ManufacturingProcess39  21.212
## ManufacturingProcess33  16.162
## ManufacturingProcess31  15.152
## ManufacturingProcess15  13.131
## ManufacturingProcess11  13.131
## ManufacturingProcess18  11.111
## ManufacturingProcess36  11.111
## ManufacturingProcess28   9.091
ggplot(varImp(cubist_model), top =15)

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

“The tree structure shows how the predictor variables are split at each node, and the terminal nodes display the average value of Yield for that section. By looking at the distribution of Yield in these terminal nodes, we can understand how different combinations of predictors are related to Yield. This view provides additional insights into the relationship between biological and process predictors and Yield. It illustrates how certain combinations of predictors—whether biological or process-related—work together to influence Yield. Negative values indicate areas where the predictors have a less significant impact or lower Yields, while positive values suggest where they have a greater influence on Yield and therefore better outcomes with these predictors.

# Fit the optimal single tree
singleTree <- rpart(Yield ~ ., data = training, method = "anova", cp = 0.01032311)

# Plot the tree
rpart.plot(singleTree)

# Display distribution of Yield in terminal nodes
singleTree$frame$yval
##  [1] -0.0003044259 -0.5188073655 -1.0089842717 -1.7458454910 -0.8247689669
##  [6] -1.2585069981 -0.6512737543 -0.2011001115 -0.5394674485 -0.7450068359
## [11] -0.2654149320  0.4222081408  0.8387276036  0.6199373903  0.4241157830
## [16]  0.2247586799  1.0506666786  1.0931729411  1.4794703713