Homework 9

Author

Naomi Buell

library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(GGally)
library(mlbench)
library(Cubist)
library(gbm)
library(party)
library(partykit)
library(RWeka)
library(rpart)
library(randomForest)
library(janitor)

# Set seed
set.seed(200)

8.1

Recreate the simulated data from Exercise 7.2:

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE) |>
    arrange(-Overall)
rfImp1

         Overall
V1   8.732235404
V4   7.615118809
V2   6.415369387
V5   2.023524577
V3   0.763591825
V6   0.165111172
V7  -0.005961659
V10 -0.074944788
V9  -0.095292651
V8  -0.166362581

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, V6-V10 were at the bottom of the list of variable importance, with scores of 0.1 or less.

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

[1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE) |> arrange(-Overall)
rfImp2

               Overall
V4          7.04752238
V2          6.06896061
V1          5.69119973
duplicate1  4.28331581
V5          1.87238438
V3          0.62970218
V6          0.13569065
V10         0.02894814
V9          0.00840438
V7         -0.01345645
V8         -0.04370565

# Add another predictor that is also highly correlated with V1
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

[1] 0.9408631

model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE) |> arrange(-Overall)
rfImp3

               Overall
V4          7.04870917
V2          6.52816504
V1          4.91687329
duplicate1  3.80068234
V5          2.03115561
duplicate2  1.87721959
V3          0.58711552
V6          0.14213148
V7          0.10991985
V10         0.09230576
V9         -0.01075028
V8         -0.08405687

# Pull importance for V1 under each model.
rfImp1_V1 <- rfImp1["V1", "Overall"]
rfImp2_V1 <- rfImp2["V1", "Overall"]
rfImp3_V1 <- rfImp3["V1", "Overall"]

Yes, the importance score for V1 changes after adding each additional highly correlated predictor. The importance score for V1 decreased from 8.7322354 to 5.6911997 to 4.9168733 with each additional predictor.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varimp(model4, conditional = TRUE) |>
    as.data.frame() |>
    clean_names() |>
    arrange(desc(varimp_model4_conditional_true |> as.numeric()))
rfImp4

           varimp_model4_conditional_true
V4                            5.080528995
V2                            4.948395426
V1                            2.469782364
duplicate1                    1.996602485
V5                            1.327443662
duplicate2                    0.684406699
V7                           -0.002487935
V3                           -0.016746065
V6                           -0.163556406
V10                          -0.164152112
V9                           -0.223192079
V8                           -0.302631078

Yes, the ordering of variable importance is similar to the others above, where V6-V10 are towards the bottom.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

# Fit boosted trees
model5 <- gbm(y ~ ., data = simulated, distribution = "gaussian")
gbmImp <- summary(model5) # Plot importance scores

gbmImp # Print importance scores

                  var    rel.inf
V4                 V4 29.9457890
V2                 V2 22.4410503
duplicate1 duplicate1 16.2795388
V5                 V5 11.7083538
V3                 V3  8.6372261
V1                 V1  6.0343438
duplicate2 duplicate2  4.5584868
V6                 V6  0.3952114
V7                 V7  0.0000000
V8                 V8  0.0000000
V9                 V9  0.0000000
V10               V10  0.0000000

# Fit Cubist
model6 <- cubist(
    x = simulated[, !names(simulated) %in% "y"],
    y = simulated$y,
)
cubImp <- varImp(model6, scale = FALSE) |>
    arrange(-Overall)
cubImp

           Overall
V1              50
V2              50
V4              50
V5              50
duplicate1      50
V3               0
V6               0
V7               0
V8               0
V9               0
V10              0
duplicate2       0

The boosted tree and cubist models show the same pattern, with V6-V10 being the least important predictors. Unlike previous models, the cubist model ranks the duplicated predictors with high variance with V1 lower on the list.

8.2

Use a simulation to show tree bias with different granularities.

Since trees suffer from selection bias where predictors with a higher number of distinct values are favored over more granular predictors, I simulate data that includes predictors with varying granularities to show this bias.

# Generate simulation data
n <- 500000 # N = 500,000 observations in simulation
# Predictors with varying granularity
most_granular <- rnorm(n) # Most granular - continuous values
very_granular <- round(rnorm(n), 2) # Rounded to 2 decimals
medium_granular <- round(rnorm(n), 1) # Rounded to 1 decimal
somewhat_granular <- round(rnorm(n)) # Integers only
least_granular <- cut(round(rnorm(n)), breaks = 2) # Least granular - 2 categories

# Response variable (equal weight for all predictors)
y <- most_granular +
  very_granular +
  somewhat_granular +
  medium_granular +
  as.numeric(least_granular) +
  rnorm(n, 0, 0.1)

# Combine into data frame
sim_data <- data.frame(
    y = y,
    most_granular = most_granular,    
    very_granular = very_granular,
    somewhat_granular = somewhat_granular,
    medium_granular = medium_granular,
    least_granular = least_granular
)

# Fit single decision tree
tree_model <- rpart(y ~ ., data = sim_data)

# Get variable importance
imp <- varImp(tree_model, scale = FALSE) |>
    arrange(-Overall)
imp

                   Overall
very_granular     2.917810
most_granular     2.239911
medium_granular   1.861564
somewhat_granular 1.613795
least_granular    1.077911

The variable importance scores show that more granular predictors are among the most important, and the rounded and categorical (least granular) predictors were less important. This demonstrates tree bias towards more granular predictors.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right focuses its importance on just the first few predictors because it has a higher bagging fraction and learning rate, which leads to more aggressive fitting of the model. This results in a smaller number of predictors being selected as important, while the model on the left spreads importance across more predictors due to its lower bagging fraction and learning rate, allowing for a more gradual fitting process.

Which model do you think would be more predictive of other samples?

The model on the left would likely be more predictive of other samples because it spreads importance across more predictors, which suggests that it is capturing a broader range of information from the data. This can lead to better generalization to new samples compared to the model on the right, which may overfit to the few predictors it focuses on.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would likely lead to a steeper slope of predictor importance.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

I use KNN data imputation, split the data, remove near zero variation predictors, center, and scale data below, as I did in exercises 6.3 and 7.5 previously.

data(ChemicalManufacturingProcess)

# impute
impute <- preProcess(
    ChemicalManufacturingProcess,
    method = c("knnImpute")
)
impute

Created from 152 samples and 58 variables

Pre-processing:
  - centered (58)
  - ignored (0)
  - 5 nearest neighbor imputation (58)
  - scaled (58)

# predict
chemical_impute <- predict(
    impute,
    ChemicalManufacturingProcess
)

# remove nzv predictors
nzv <- nearZeroVar(chemical_impute)
filtered_chemical <- chemical_impute[, -nzv]

# Split the data into a training and a test set
trainingRows <- createDataPartition(
    filtered_chemical$Yield,
    p = .80,
    list = FALSE
)
chemical_train <- filtered_chemical[trainingRows, ]
chemical_test <- filtered_chemical[-trainingRows, ]

Next I will train single tree, model tree, bagged tree, random forest, boosted tree, and cubist tree-based models.

# Train
rpartTune <- train(
    chemical_train[, !names(chemical_train) %in% "Yield"],
    chemical_train$Yield,
    method = "rpart2",
    tuneLength = 10,
    trControl = trainControl(method = "cv")
)
rpartTune

CART 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 131, 129, 128, 129, 129, ... 
Resampling results across tuning parameters:

  maxdepth  RMSE       Rsquared   MAE      
   1        0.7939879  0.3790177  0.6193060
   2        0.7952428  0.3743530  0.6362057
   3        0.8080370  0.3649796  0.6536293
   4        0.8007317  0.3954948  0.6354054
   5        0.8263759  0.3627811  0.6450699
   6        0.8194596  0.3799275  0.6414531
   7        0.8232977  0.3779287  0.6440841
   8        0.8209108  0.3899266  0.6477862
   9        0.8287774  0.3857141  0.6496598
  10        0.8230216  0.3958743  0.6398306

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 1.

# Predict
rpartPred <- predict(
    rpartTune,
    newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)

# Train
m5Tune <- train(
    chemical_train[, !names(chemical_train) %in% "Yield"],
    chemical_train$Yield,
    method = "M5",
    trControl = trainControl(method = "cv"),
    control = Weka_control(M = 10)
)
m5Tune

Model Tree 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 131, 129, 130, 129, 130, 130, ... 
Resampling results across tuning parameters:

  pruned  smoothed  rules  RMSE       Rsquared   MAE      
  Yes     Yes       Yes    0.6644582  0.5583891  0.5513304
  Yes     Yes       No     0.6603697  0.5624499  0.5494850
  Yes     No        Yes    0.6670654  0.5622058  0.5551017
  Yes     No        No     0.6855313  0.5355528  0.5703550
  No      Yes       Yes    0.7560906  0.4667567  0.6079317
  No      Yes       No     0.6786198  0.5448752  0.5340998
  No      No        Yes    0.8943379  0.3513275  0.7079283
  No      No        No     0.8382678  0.4173680  0.6224411

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were pruned = Yes, smoothed = Yes and
 rules = No.

# Predict
m5Pred <- predict(
    m5Tune,
    newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)

# Train
rfModel <- randomForest(
    chemical_train[, !names(chemical_train) %in% "Yield"],
    chemical_train$Yield,
    importance = TRUE,
    ntrees = 1000
)
rfModel


Call:
 randomForest(x = chemical_train[, !names(chemical_train) %in%      "Yield"], y = chemical_train$Yield, importance = TRUE, ntrees = 1000) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 18

          Mean of squared residuals: 0.3995747
                    % Var explained: 59.56

# Predict
rfPred <- predict(
    rfModel,
    newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)

# Train
gbmGrid <- expand.grid(
    interaction.depth = seq(1, 7, by = 2),
    n.trees = seq(100, 1000, by = 50),
    shrinkage = c(0.01, 0.1),
    n.minobsinnode = 10
)
gbmTune <- train(
    chemical_train[, !names(chemical_train) %in% "Yield"],
    chemical_train$Yield,
    method = "gbm",
    tuneGrid = gbmGrid,
    verbose = FALSE
)
gbmTune

Stochastic Gradient Boosting 

144 samples
 56 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.trees  RMSE       Rsquared   MAE      
  0.01       1                   100     0.8192007  0.4387410  0.6588791
  0.01       1                   150     0.7819880  0.4592703  0.6265023
  0.01       1                   200     0.7591395  0.4689319  0.6063344
  0.01       1                   250     0.7435927  0.4759151  0.5921697
  0.01       1                   300     0.7321354  0.4824040  0.5816901
  0.01       1                   350     0.7241348  0.4863530  0.5745643
  0.01       1                   400     0.7186706  0.4891357  0.5694321
  0.01       1                   450     0.7153057  0.4909222  0.5662060
  0.01       1                   500     0.7117756  0.4936824  0.5624124
  0.01       1                   550     0.7093585  0.4954635  0.5600009
  0.01       1                   600     0.7073823  0.4970491  0.5580834
  0.01       1                   650     0.7054321  0.4987813  0.5560847
  0.01       1                   700     0.7038043  0.5002028  0.5549556
  0.01       1                   750     0.7022719  0.5014103  0.5541104
  0.01       1                   800     0.7011923  0.5024915  0.5533679
  0.01       1                   850     0.7004643  0.5031950  0.5528763
  0.01       1                   900     0.6999880  0.5035690  0.5528412
  0.01       1                   950     0.6991241  0.5043870  0.5521222
  0.01       1                  1000     0.6982518  0.5052653  0.5515482
  0.01       3                   100     0.7751655  0.4812902  0.6209172
  0.01       3                   150     0.7373407  0.4974227  0.5868189
  0.01       3                   200     0.7151222  0.5087493  0.5670061
  0.01       3                   250     0.7015054  0.5166029  0.5548201
  0.01       3                   300     0.6933257  0.5213276  0.5478930
  0.01       3                   350     0.6877716  0.5255079  0.5433292
  0.01       3                   400     0.6828004  0.5296532  0.5393118
  0.01       3                   450     0.6792701  0.5329816  0.5365429
  0.01       3                   500     0.6766992  0.5354865  0.5346661
  0.01       3                   550     0.6748021  0.5375307  0.5335280
  0.01       3                   600     0.6734755  0.5390185  0.5326339
  0.01       3                   650     0.6720125  0.5406904  0.5317651
  0.01       3                   700     0.6709957  0.5418618  0.5310225
  0.01       3                   750     0.6700056  0.5428083  0.5301231
  0.01       3                   800     0.6691661  0.5439259  0.5293763
  0.01       3                   850     0.6681675  0.5450443  0.5287834
  0.01       3                   900     0.6672259  0.5462924  0.5280622
  0.01       3                   950     0.6668358  0.5468759  0.5278012
  0.01       3                  1000     0.6663054  0.5476314  0.5275184
  0.01       5                   100     0.7698073  0.4841970  0.6150885
  0.01       5                   150     0.7321726  0.4994533  0.5796588
  0.01       5                   200     0.7097367  0.5139510  0.5600242
  0.01       5                   250     0.6974024  0.5210555  0.5499256
  0.01       5                   300     0.6902545  0.5257209  0.5439786
  0.01       5                   350     0.6849169  0.5299385  0.5394740
  0.01       5                   400     0.6807538  0.5333697  0.5361648
  0.01       5                   450     0.6772446  0.5368938  0.5336563
  0.01       5                   500     0.6749934  0.5390685  0.5321270
  0.01       5                   550     0.6731228  0.5411697  0.5309413
  0.01       5                   600     0.6712496  0.5434230  0.5299119
  0.01       5                   650     0.6693628  0.5456832  0.5285164
  0.01       5                   700     0.6675674  0.5478610  0.5272880
  0.01       5                   750     0.6664925  0.5491790  0.5266144
  0.01       5                   800     0.6656031  0.5504335  0.5261653
  0.01       5                   850     0.6645580  0.5517645  0.5253529
  0.01       5                   900     0.6636362  0.5529843  0.5247007
  0.01       5                   950     0.6629325  0.5538954  0.5241265
  0.01       5                  1000     0.6623751  0.5545846  0.5237786
  0.01       7                   100     0.7695176  0.4862516  0.6154580
  0.01       7                   150     0.7307072  0.5037883  0.5799249
  0.01       7                   200     0.7102870  0.5127045  0.5608227
  0.01       7                   250     0.6976708  0.5201822  0.5498434
  0.01       7                   300     0.6895817  0.5257588  0.5424327
  0.01       7                   350     0.6846348  0.5292546  0.5381727
  0.01       7                   400     0.6806331  0.5329040  0.5353010
  0.01       7                   450     0.6773108  0.5362745  0.5332532
  0.01       7                   500     0.6747121  0.5389698  0.5313273
  0.01       7                   550     0.6725161  0.5413620  0.5297427
  0.01       7                   600     0.6708064  0.5433768  0.5286419
  0.01       7                   650     0.6693191  0.5452413  0.5276077
  0.01       7                   700     0.6683426  0.5462748  0.5272051
  0.01       7                   750     0.6669679  0.5478872  0.5263739
  0.01       7                   800     0.6658526  0.5494356  0.5257545
  0.01       7                   850     0.6648872  0.5506465  0.5249069
  0.01       7                   900     0.6641470  0.5518161  0.5244138
  0.01       7                   950     0.6634628  0.5527849  0.5240500
  0.01       7                  1000     0.6629003  0.5534641  0.5237444
  0.10       1                   100     0.7028540  0.5001290  0.5554063
  0.10       1                   150     0.7002613  0.5053635  0.5548254
  0.10       1                   200     0.7047638  0.5024541  0.5597897
  0.10       1                   250     0.7056576  0.5025781  0.5604347
  0.10       1                   300     0.7063996  0.5031552  0.5612299
  0.10       1                   350     0.7095600  0.5004912  0.5627701
  0.10       1                   400     0.7102189  0.5008242  0.5646824
  0.10       1                   450     0.7102818  0.5013988  0.5638107
  0.10       1                   500     0.7125136  0.4989532  0.5660312
  0.10       1                   550     0.7140632  0.4975104  0.5666163
  0.10       1                   600     0.7132155  0.4990397  0.5655741
  0.10       1                   650     0.7142424  0.4987236  0.5659557
  0.10       1                   700     0.7140935  0.4992467  0.5656856
  0.10       1                   750     0.7146885  0.4986257  0.5664475
  0.10       1                   800     0.7155274  0.4982165  0.5668084
  0.10       1                   850     0.7159218  0.4979866  0.5673861
  0.10       1                   900     0.7169646  0.4973104  0.5682348
  0.10       1                   950     0.7175031  0.4968325  0.5687843
  0.10       1                  1000     0.7178399  0.4968816  0.5689116
  0.10       3                   100     0.6779103  0.5330921  0.5350331
  0.10       3                   150     0.6751728  0.5368774  0.5329682
  0.10       3                   200     0.6719459  0.5421543  0.5304624
  0.10       3                   250     0.6703474  0.5445313  0.5298977
  0.10       3                   300     0.6691899  0.5462120  0.5293082
  0.10       3                   350     0.6685266  0.5471119  0.5288868
  0.10       3                   400     0.6681860  0.5476769  0.5286364
  0.10       3                   450     0.6678494  0.5482909  0.5283516
  0.10       3                   500     0.6675173  0.5488567  0.5280958
  0.10       3                   550     0.6676729  0.5487780  0.5281742
  0.10       3                   600     0.6672371  0.5493949  0.5277681
  0.10       3                   650     0.6671822  0.5495658  0.5277912
  0.10       3                   700     0.6669347  0.5498732  0.5275374
  0.10       3                   750     0.6669031  0.5500046  0.5275145
  0.10       3                   800     0.6667918  0.5501483  0.5274087
  0.10       3                   850     0.6668328  0.5501782  0.5274805
  0.10       3                   900     0.6667916  0.5502431  0.5274328
  0.10       3                   950     0.6667369  0.5503529  0.5273884
  0.10       3                  1000     0.6667202  0.5503915  0.5273515
  0.10       5                   100     0.6804250  0.5289773  0.5374433
  0.10       5                   150     0.6769271  0.5345524  0.5357542
  0.10       5                   200     0.6746945  0.5373662  0.5352004
  0.10       5                   250     0.6733971  0.5395160  0.5346887
  0.10       5                   300     0.6727300  0.5404338  0.5345127
  0.10       5                   350     0.6724945  0.5410784  0.5348770
  0.10       5                   400     0.6715958  0.5423929  0.5344657
  0.10       5                   450     0.6715655  0.5426478  0.5347288
  0.10       5                   500     0.6713949  0.5430180  0.5347509
  0.10       5                   550     0.6712143  0.5433243  0.5346925
  0.10       5                   600     0.6711313  0.5435056  0.5346866
  0.10       5                   650     0.6711397  0.5435877  0.5347684
  0.10       5                   700     0.6710655  0.5437447  0.5346854
  0.10       5                   750     0.6710008  0.5438586  0.5346406
  0.10       5                   800     0.6709471  0.5439749  0.5346592
  0.10       5                   850     0.6709528  0.5440375  0.5346823
  0.10       5                   900     0.6709262  0.5440431  0.5346695
  0.10       5                   950     0.6708974  0.5440832  0.5346351
  0.10       5                  1000     0.6709067  0.5440988  0.5346520
  0.10       7                   100     0.6791785  0.5295221  0.5391169
  0.10       7                   150     0.6749041  0.5363048  0.5360778
  0.10       7                   200     0.6731576  0.5391695  0.5350272
  0.10       7                   250     0.6717598  0.5413434  0.5337615
  0.10       7                   300     0.6706521  0.5429353  0.5328758
  0.10       7                   350     0.6701359  0.5438818  0.5324156
  0.10       7                   400     0.6694551  0.5447607  0.5317706
  0.10       7                   450     0.6692304  0.5451621  0.5315433
  0.10       7                   500     0.6688826  0.5457676  0.5313102
  0.10       7                   550     0.6685351  0.5463904  0.5309713
  0.10       7                   600     0.6685301  0.5464992  0.5309858
  0.10       7                   650     0.6682944  0.5468495  0.5307741
  0.10       7                   700     0.6682105  0.5469693  0.5306682
  0.10       7                   750     0.6681736  0.5470991  0.5306506
  0.10       7                   800     0.6681006  0.5471774  0.5306031
  0.10       7                   850     0.6680863  0.5472251  0.5306258
  0.10       7                   900     0.6681013  0.5472295  0.5306316
  0.10       7                   950     0.6680747  0.5472807  0.5306286
  0.10       7                  1000     0.6680960  0.5472618  0.5306379

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 1000, interaction.depth =
 5, shrinkage = 0.01 and n.minobsinnode = 10.

# Predict
gbmPred <- predict(
    gbmTune,
    newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)

# Train
cubistMod <- cubist(
    chemical_train[, !names(chemical_train) %in% "Yield"],
    chemical_train$Yield
)
cubistMod


Call:
cubist.default(x = chemical_train[, !names(chemical_train) %in% "Yield"], y
 = chemical_train$Yield)

Number of samples: 144 
Number of predictors: 56 

Number of committees: 1 
Number of rules: 2

# Predict
cubistPred <- predict(
    cubistMod,
    chemical_test[, !names(chemical_test) %in% "Yield"]
)

Which tree-based regression model gives the optimal re-sampling and test set performance?

ranking <- data.frame(
    Model = c(
        "Single Tree",
        "Model Tree",
        "Bagged Tree",
        "Random Forest",
        "Boosted Tree",
        "Cubist Tree"
    ),
    rbind(
        postResample(pred = rpartPred, obs = chemical_test$Yield),
        postResample(pred = m5Pred, obs = chemical_test$Yield),
        postResample(pred = baggedPred, obs = chemical_test$Yield),
        postResample(pred = rfPred, obs = chemical_test$Yield),
        postResample(pred = gbmPred, obs = chemical_test$Yield),
        postResample(pred = cubistPred, obs = chemical_test$Yield)
    )
) |>
    arrange(RMSE)
ranking

          Model      RMSE  Rsquared       MAE
1    Model Tree 0.5392361 0.7898164 0.4546390
2   Cubist Tree 0.5896184 0.6606362 0.4770851
3  Boosted Tree 0.5946542 0.6738592 0.4089447
4 Random Forest 0.6281927 0.6711592 0.4590116
5   Bagged Tree 0.6564029 0.6911976 0.5062169
6   Single Tree 0.7609376 0.4370784 0.6606858

best <- ranking[1, 1]

The Model Tree model gives the optimal resampling and test set performance, as it has the lowest RMSE, highest R squared, and lowest MAE.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

importance <- varImp(m5Tune, scale = FALSE)$importance |>
    arrange(-Overall)
most_important <- importance |>
    head(10) |>
    rownames()
most_important

 [1] "ManufacturingProcess32" "ManufacturingProcess13" "BiologicalMaterial06"  
 [4] "ManufacturingProcess09" "ManufacturingProcess36" "BiologicalMaterial03"  
 [7] "ManufacturingProcess17" "BiologicalMaterial02"   "ManufacturingProcess31"
[10] "BiologicalMaterial12"

The following predictors are the most important in the optimal tree-based model, Model Tree: ManufacturingProcess32, ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess09, ManufacturingProcess36, BiologicalMaterial03, ManufacturingProcess17, BiologicalMaterial02, ManufacturingProcess31, BiologicalMaterial12. Manufacturing processes tend to dominate this list.

The top 10 important predictors from the optimal nonlinear model, the SVM nonlinear regression model, from 7.5 (published on RPubs here) are shown below. They are identical to the important tree-based model predictors, but in a different order.

Most important predictors for optimal nonlinear model show a mix of important biological and process predictors.

The top 10 important predictors from the optimal linear model from 6.3 (published on RPubs here) are shown below. All important linear predictors below were also considered important in the tree-based model.

Most important predictors in partial least squares (linear) model are manufacturing processes.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

rpartTree2 <- as.party(rpartTune$finalModel)
plot(rpartTree2)

This plot reveals that manufacturing processes are most consequential in predicting yield, with ManufacturingProcess32 determining the split at the first node. The distribution of yield in the terminal nodes shows that the yield is highest when ManufacturingProcess32 is greater than the threshold.