library(tidyverse)
library(AppliedPredictiveModeling)
library(caret)
library(GGally)
library(mlbench)
library(Cubist)
library(gbm)
library(party)
library(partykit)
library(RWeka)
library(rpart)
library(randomForest)
library(janitor)
# Set seed
set.seed(200)Homework 9
8.1
Recreate the simulated data from Exercise 7.2:
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"- Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE) |>
arrange(-Overall)
rfImp1 Overall
V1 8.732235404
V4 7.615118809
V2 6.415369387
V5 2.023524577
V3 0.763591825
V6 0.165111172
V7 -0.005961659
V10 -0.074944788
V9 -0.095292651
V8 -0.166362581
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, V6-V10 were at the bottom of the list of variable importance, with scores of 0.1 or less.
- Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)[1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE) |> arrange(-Overall)
rfImp2 Overall
V4 7.04752238
V2 6.06896061
V1 5.69119973
duplicate1 4.28331581
V5 1.87238438
V3 0.62970218
V6 0.13569065
V10 0.02894814
V9 0.00840438
V7 -0.01345645
V8 -0.04370565
# Add another predictor that is also highly correlated with V1
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)[1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE) |> arrange(-Overall)
rfImp3 Overall
V4 7.04870917
V2 6.52816504
V1 4.91687329
duplicate1 3.80068234
V5 2.03115561
duplicate2 1.87721959
V3 0.58711552
V6 0.14213148
V7 0.10991985
V10 0.09230576
V9 -0.01075028
V8 -0.08405687
# Pull importance for V1 under each model.
rfImp1_V1 <- rfImp1["V1", "Overall"]
rfImp2_V1 <- rfImp2["V1", "Overall"]
rfImp3_V1 <- rfImp3["V1", "Overall"]Yes, the importance score for V1 changes after adding each additional highly correlated predictor. The importance score for V1 decreased from 8.7322354 to 5.6911997 to 4.9168733 with each additional predictor.
- Use the
cforestfunction in thepartypackage to fit a random forest model using conditional inference trees. The party package functionvarimpcan calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varimp(model4, conditional = TRUE) |>
as.data.frame() |>
clean_names() |>
arrange(desc(varimp_model4_conditional_true |> as.numeric()))
rfImp4 varimp_model4_conditional_true
V4 5.080528995
V2 4.948395426
V1 2.469782364
duplicate1 1.996602485
V5 1.327443662
duplicate2 0.684406699
V7 -0.002487935
V3 -0.016746065
V6 -0.163556406
V10 -0.164152112
V9 -0.223192079
V8 -0.302631078
Yes, the ordering of variable importance is similar to the others above, where V6-V10 are towards the bottom.
- Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# Fit boosted trees
model5 <- gbm(y ~ ., data = simulated, distribution = "gaussian")
gbmImp <- summary(model5) # Plot importance scoresgbmImp # Print importance scores var rel.inf
V4 V4 29.9457890
V2 V2 22.4410503
duplicate1 duplicate1 16.2795388
V5 V5 11.7083538
V3 V3 8.6372261
V1 V1 6.0343438
duplicate2 duplicate2 4.5584868
V6 V6 0.3952114
V7 V7 0.0000000
V8 V8 0.0000000
V9 V9 0.0000000
V10 V10 0.0000000
# Fit Cubist
model6 <- cubist(
x = simulated[, !names(simulated) %in% "y"],
y = simulated$y,
)
cubImp <- varImp(model6, scale = FALSE) |>
arrange(-Overall)
cubImp Overall
V1 50
V2 50
V4 50
V5 50
duplicate1 50
V3 0
V6 0
V7 0
V8 0
V9 0
V10 0
duplicate2 0
The boosted tree and cubist models show the same pattern, with V6-V10 being the least important predictors. Unlike previous models, the cubist model ranks the duplicated predictors with high variance with V1 lower on the list.
8.2
Use a simulation to show tree bias with different granularities.
Since trees suffer from selection bias where predictors with a higher number of distinct values are favored over more granular predictors, I simulate data that includes predictors with varying granularities to show this bias.
# Generate simulation data
n <- 500000 # N = 500,000 observations in simulation
# Predictors with varying granularity
most_granular <- rnorm(n) # Most granular - continuous values
very_granular <- round(rnorm(n), 2) # Rounded to 2 decimals
medium_granular <- round(rnorm(n), 1) # Rounded to 1 decimal
somewhat_granular <- round(rnorm(n)) # Integers only
least_granular <- cut(round(rnorm(n)), breaks = 2) # Least granular - 2 categories
# Response variable (equal weight for all predictors)
y <- most_granular +
very_granular +
somewhat_granular +
medium_granular +
as.numeric(least_granular) +
rnorm(n, 0, 0.1)
# Combine into data frame
sim_data <- data.frame(
y = y,
most_granular = most_granular,
very_granular = very_granular,
somewhat_granular = somewhat_granular,
medium_granular = medium_granular,
least_granular = least_granular
)
# Fit single decision tree
tree_model <- rpart(y ~ ., data = sim_data)
# Get variable importance
imp <- varImp(tree_model, scale = FALSE) |>
arrange(-Overall)
imp Overall
very_granular 2.917810
most_granular 2.239911
medium_granular 1.861564
somewhat_granular 1.613795
least_granular 1.077911
The variable importance scores show that more granular predictors are among the most important, and the rounded and categorical (least granular) predictors were less important. This demonstrates tree bias towards more granular predictors.
8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
- Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right focuses its importance on just the first few predictors because it has a higher bagging fraction and learning rate, which leads to more aggressive fitting of the model. This results in a smaller number of predictors being selected as important, while the model on the left spreads importance across more predictors due to its lower bagging fraction and learning rate, allowing for a more gradual fitting process.
- Which model do you think would be more predictive of other samples?
The model on the left would likely be more predictive of other samples because it spreads importance across more predictors, which suggests that it is capturing a broader range of information from the data. This can lead to better generalization to new samples compared to the model on the right, which may overfit to the few predictors it focuses on.
- How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing interaction depth would likely lead to a steeper slope of predictor importance.
8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
I use KNN data imputation, split the data, remove near zero variation predictors, center, and scale data below, as I did in exercises 6.3 and 7.5 previously.
data(ChemicalManufacturingProcess)
# impute
impute <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute")
)
imputeCreated from 152 samples and 58 variables
Pre-processing:
- centered (58)
- ignored (0)
- 5 nearest neighbor imputation (58)
- scaled (58)
# predict
chemical_impute <- predict(
impute,
ChemicalManufacturingProcess
)
# remove nzv predictors
nzv <- nearZeroVar(chemical_impute)
filtered_chemical <- chemical_impute[, -nzv]
# Split the data into a training and a test set
trainingRows <- createDataPartition(
filtered_chemical$Yield,
p = .80,
list = FALSE
)
chemical_train <- filtered_chemical[trainingRows, ]
chemical_test <- filtered_chemical[-trainingRows, ]Next I will train single tree, model tree, bagged tree, random forest, boosted tree, and cubist tree-based models.
# Train
rpartTune <- train(
chemical_train[, !names(chemical_train) %in% "Yield"],
chemical_train$Yield,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv")
)
rpartTuneCART
144 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 130, 131, 129, 128, 129, 129, ...
Resampling results across tuning parameters:
maxdepth RMSE Rsquared MAE
1 0.7939879 0.3790177 0.6193060
2 0.7952428 0.3743530 0.6362057
3 0.8080370 0.3649796 0.6536293
4 0.8007317 0.3954948 0.6354054
5 0.8263759 0.3627811 0.6450699
6 0.8194596 0.3799275 0.6414531
7 0.8232977 0.3779287 0.6440841
8 0.8209108 0.3899266 0.6477862
9 0.8287774 0.3857141 0.6496598
10 0.8230216 0.3958743 0.6398306
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 1.
# Predict
rpartPred <- predict(
rpartTune,
newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)# Train
m5Tune <- train(
chemical_train[, !names(chemical_train) %in% "Yield"],
chemical_train$Yield,
method = "M5",
trControl = trainControl(method = "cv"),
control = Weka_control(M = 10)
)
m5TuneModel Tree
144 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 131, 129, 130, 129, 130, 130, ...
Resampling results across tuning parameters:
pruned smoothed rules RMSE Rsquared MAE
Yes Yes Yes 0.6644582 0.5583891 0.5513304
Yes Yes No 0.6603697 0.5624499 0.5494850
Yes No Yes 0.6670654 0.5622058 0.5551017
Yes No No 0.6855313 0.5355528 0.5703550
No Yes Yes 0.7560906 0.4667567 0.6079317
No Yes No 0.6786198 0.5448752 0.5340998
No No Yes 0.8943379 0.3513275 0.7079283
No No No 0.8382678 0.4173680 0.6224411
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were pruned = Yes, smoothed = Yes and
rules = No.
# Predict
m5Pred <- predict(
m5Tune,
newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)# Train
rfModel <- randomForest(
chemical_train[, !names(chemical_train) %in% "Yield"],
chemical_train$Yield,
importance = TRUE,
ntrees = 1000
)
rfModel
Call:
randomForest(x = chemical_train[, !names(chemical_train) %in% "Yield"], y = chemical_train$Yield, importance = TRUE, ntrees = 1000)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 18
Mean of squared residuals: 0.3995747
% Var explained: 59.56
# Predict
rfPred <- predict(
rfModel,
newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)# Train
gbmGrid <- expand.grid(
interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10
)
gbmTune <- train(
chemical_train[, !names(chemical_train) %in% "Yield"],
chemical_train$Yield,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE
)
gbmTuneStochastic Gradient Boosting
144 samples
56 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
Resampling results across tuning parameters:
shrinkage interaction.depth n.trees RMSE Rsquared MAE
0.01 1 100 0.8192007 0.4387410 0.6588791
0.01 1 150 0.7819880 0.4592703 0.6265023
0.01 1 200 0.7591395 0.4689319 0.6063344
0.01 1 250 0.7435927 0.4759151 0.5921697
0.01 1 300 0.7321354 0.4824040 0.5816901
0.01 1 350 0.7241348 0.4863530 0.5745643
0.01 1 400 0.7186706 0.4891357 0.5694321
0.01 1 450 0.7153057 0.4909222 0.5662060
0.01 1 500 0.7117756 0.4936824 0.5624124
0.01 1 550 0.7093585 0.4954635 0.5600009
0.01 1 600 0.7073823 0.4970491 0.5580834
0.01 1 650 0.7054321 0.4987813 0.5560847
0.01 1 700 0.7038043 0.5002028 0.5549556
0.01 1 750 0.7022719 0.5014103 0.5541104
0.01 1 800 0.7011923 0.5024915 0.5533679
0.01 1 850 0.7004643 0.5031950 0.5528763
0.01 1 900 0.6999880 0.5035690 0.5528412
0.01 1 950 0.6991241 0.5043870 0.5521222
0.01 1 1000 0.6982518 0.5052653 0.5515482
0.01 3 100 0.7751655 0.4812902 0.6209172
0.01 3 150 0.7373407 0.4974227 0.5868189
0.01 3 200 0.7151222 0.5087493 0.5670061
0.01 3 250 0.7015054 0.5166029 0.5548201
0.01 3 300 0.6933257 0.5213276 0.5478930
0.01 3 350 0.6877716 0.5255079 0.5433292
0.01 3 400 0.6828004 0.5296532 0.5393118
0.01 3 450 0.6792701 0.5329816 0.5365429
0.01 3 500 0.6766992 0.5354865 0.5346661
0.01 3 550 0.6748021 0.5375307 0.5335280
0.01 3 600 0.6734755 0.5390185 0.5326339
0.01 3 650 0.6720125 0.5406904 0.5317651
0.01 3 700 0.6709957 0.5418618 0.5310225
0.01 3 750 0.6700056 0.5428083 0.5301231
0.01 3 800 0.6691661 0.5439259 0.5293763
0.01 3 850 0.6681675 0.5450443 0.5287834
0.01 3 900 0.6672259 0.5462924 0.5280622
0.01 3 950 0.6668358 0.5468759 0.5278012
0.01 3 1000 0.6663054 0.5476314 0.5275184
0.01 5 100 0.7698073 0.4841970 0.6150885
0.01 5 150 0.7321726 0.4994533 0.5796588
0.01 5 200 0.7097367 0.5139510 0.5600242
0.01 5 250 0.6974024 0.5210555 0.5499256
0.01 5 300 0.6902545 0.5257209 0.5439786
0.01 5 350 0.6849169 0.5299385 0.5394740
0.01 5 400 0.6807538 0.5333697 0.5361648
0.01 5 450 0.6772446 0.5368938 0.5336563
0.01 5 500 0.6749934 0.5390685 0.5321270
0.01 5 550 0.6731228 0.5411697 0.5309413
0.01 5 600 0.6712496 0.5434230 0.5299119
0.01 5 650 0.6693628 0.5456832 0.5285164
0.01 5 700 0.6675674 0.5478610 0.5272880
0.01 5 750 0.6664925 0.5491790 0.5266144
0.01 5 800 0.6656031 0.5504335 0.5261653
0.01 5 850 0.6645580 0.5517645 0.5253529
0.01 5 900 0.6636362 0.5529843 0.5247007
0.01 5 950 0.6629325 0.5538954 0.5241265
0.01 5 1000 0.6623751 0.5545846 0.5237786
0.01 7 100 0.7695176 0.4862516 0.6154580
0.01 7 150 0.7307072 0.5037883 0.5799249
0.01 7 200 0.7102870 0.5127045 0.5608227
0.01 7 250 0.6976708 0.5201822 0.5498434
0.01 7 300 0.6895817 0.5257588 0.5424327
0.01 7 350 0.6846348 0.5292546 0.5381727
0.01 7 400 0.6806331 0.5329040 0.5353010
0.01 7 450 0.6773108 0.5362745 0.5332532
0.01 7 500 0.6747121 0.5389698 0.5313273
0.01 7 550 0.6725161 0.5413620 0.5297427
0.01 7 600 0.6708064 0.5433768 0.5286419
0.01 7 650 0.6693191 0.5452413 0.5276077
0.01 7 700 0.6683426 0.5462748 0.5272051
0.01 7 750 0.6669679 0.5478872 0.5263739
0.01 7 800 0.6658526 0.5494356 0.5257545
0.01 7 850 0.6648872 0.5506465 0.5249069
0.01 7 900 0.6641470 0.5518161 0.5244138
0.01 7 950 0.6634628 0.5527849 0.5240500
0.01 7 1000 0.6629003 0.5534641 0.5237444
0.10 1 100 0.7028540 0.5001290 0.5554063
0.10 1 150 0.7002613 0.5053635 0.5548254
0.10 1 200 0.7047638 0.5024541 0.5597897
0.10 1 250 0.7056576 0.5025781 0.5604347
0.10 1 300 0.7063996 0.5031552 0.5612299
0.10 1 350 0.7095600 0.5004912 0.5627701
0.10 1 400 0.7102189 0.5008242 0.5646824
0.10 1 450 0.7102818 0.5013988 0.5638107
0.10 1 500 0.7125136 0.4989532 0.5660312
0.10 1 550 0.7140632 0.4975104 0.5666163
0.10 1 600 0.7132155 0.4990397 0.5655741
0.10 1 650 0.7142424 0.4987236 0.5659557
0.10 1 700 0.7140935 0.4992467 0.5656856
0.10 1 750 0.7146885 0.4986257 0.5664475
0.10 1 800 0.7155274 0.4982165 0.5668084
0.10 1 850 0.7159218 0.4979866 0.5673861
0.10 1 900 0.7169646 0.4973104 0.5682348
0.10 1 950 0.7175031 0.4968325 0.5687843
0.10 1 1000 0.7178399 0.4968816 0.5689116
0.10 3 100 0.6779103 0.5330921 0.5350331
0.10 3 150 0.6751728 0.5368774 0.5329682
0.10 3 200 0.6719459 0.5421543 0.5304624
0.10 3 250 0.6703474 0.5445313 0.5298977
0.10 3 300 0.6691899 0.5462120 0.5293082
0.10 3 350 0.6685266 0.5471119 0.5288868
0.10 3 400 0.6681860 0.5476769 0.5286364
0.10 3 450 0.6678494 0.5482909 0.5283516
0.10 3 500 0.6675173 0.5488567 0.5280958
0.10 3 550 0.6676729 0.5487780 0.5281742
0.10 3 600 0.6672371 0.5493949 0.5277681
0.10 3 650 0.6671822 0.5495658 0.5277912
0.10 3 700 0.6669347 0.5498732 0.5275374
0.10 3 750 0.6669031 0.5500046 0.5275145
0.10 3 800 0.6667918 0.5501483 0.5274087
0.10 3 850 0.6668328 0.5501782 0.5274805
0.10 3 900 0.6667916 0.5502431 0.5274328
0.10 3 950 0.6667369 0.5503529 0.5273884
0.10 3 1000 0.6667202 0.5503915 0.5273515
0.10 5 100 0.6804250 0.5289773 0.5374433
0.10 5 150 0.6769271 0.5345524 0.5357542
0.10 5 200 0.6746945 0.5373662 0.5352004
0.10 5 250 0.6733971 0.5395160 0.5346887
0.10 5 300 0.6727300 0.5404338 0.5345127
0.10 5 350 0.6724945 0.5410784 0.5348770
0.10 5 400 0.6715958 0.5423929 0.5344657
0.10 5 450 0.6715655 0.5426478 0.5347288
0.10 5 500 0.6713949 0.5430180 0.5347509
0.10 5 550 0.6712143 0.5433243 0.5346925
0.10 5 600 0.6711313 0.5435056 0.5346866
0.10 5 650 0.6711397 0.5435877 0.5347684
0.10 5 700 0.6710655 0.5437447 0.5346854
0.10 5 750 0.6710008 0.5438586 0.5346406
0.10 5 800 0.6709471 0.5439749 0.5346592
0.10 5 850 0.6709528 0.5440375 0.5346823
0.10 5 900 0.6709262 0.5440431 0.5346695
0.10 5 950 0.6708974 0.5440832 0.5346351
0.10 5 1000 0.6709067 0.5440988 0.5346520
0.10 7 100 0.6791785 0.5295221 0.5391169
0.10 7 150 0.6749041 0.5363048 0.5360778
0.10 7 200 0.6731576 0.5391695 0.5350272
0.10 7 250 0.6717598 0.5413434 0.5337615
0.10 7 300 0.6706521 0.5429353 0.5328758
0.10 7 350 0.6701359 0.5438818 0.5324156
0.10 7 400 0.6694551 0.5447607 0.5317706
0.10 7 450 0.6692304 0.5451621 0.5315433
0.10 7 500 0.6688826 0.5457676 0.5313102
0.10 7 550 0.6685351 0.5463904 0.5309713
0.10 7 600 0.6685301 0.5464992 0.5309858
0.10 7 650 0.6682944 0.5468495 0.5307741
0.10 7 700 0.6682105 0.5469693 0.5306682
0.10 7 750 0.6681736 0.5470991 0.5306506
0.10 7 800 0.6681006 0.5471774 0.5306031
0.10 7 850 0.6680863 0.5472251 0.5306258
0.10 7 900 0.6681013 0.5472295 0.5306316
0.10 7 950 0.6680747 0.5472807 0.5306286
0.10 7 1000 0.6680960 0.5472618 0.5306379
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 1000, interaction.depth =
5, shrinkage = 0.01 and n.minobsinnode = 10.
# Predict
gbmPred <- predict(
gbmTune,
newdata = chemical_test[, !names(chemical_test) %in% "Yield"]
)# Train
cubistMod <- cubist(
chemical_train[, !names(chemical_train) %in% "Yield"],
chemical_train$Yield
)
cubistMod
Call:
cubist.default(x = chemical_train[, !names(chemical_train) %in% "Yield"], y
= chemical_train$Yield)
Number of samples: 144
Number of predictors: 56
Number of committees: 1
Number of rules: 2
# Predict
cubistPred <- predict(
cubistMod,
chemical_test[, !names(chemical_test) %in% "Yield"]
)- Which tree-based regression model gives the optimal re-sampling and test set performance?
ranking <- data.frame(
Model = c(
"Single Tree",
"Model Tree",
"Bagged Tree",
"Random Forest",
"Boosted Tree",
"Cubist Tree"
),
rbind(
postResample(pred = rpartPred, obs = chemical_test$Yield),
postResample(pred = m5Pred, obs = chemical_test$Yield),
postResample(pred = baggedPred, obs = chemical_test$Yield),
postResample(pred = rfPred, obs = chemical_test$Yield),
postResample(pred = gbmPred, obs = chemical_test$Yield),
postResample(pred = cubistPred, obs = chemical_test$Yield)
)
) |>
arrange(RMSE)
ranking Model RMSE Rsquared MAE
1 Model Tree 0.5392361 0.7898164 0.4546390
2 Cubist Tree 0.5896184 0.6606362 0.4770851
3 Boosted Tree 0.5946542 0.6738592 0.4089447
4 Random Forest 0.6281927 0.6711592 0.4590116
5 Bagged Tree 0.6564029 0.6911976 0.5062169
6 Single Tree 0.7609376 0.4370784 0.6606858
best <- ranking[1, 1]The Model Tree model gives the optimal resampling and test set performance, as it has the lowest RMSE, highest R squared, and lowest MAE.
- Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
importance <- varImp(m5Tune, scale = FALSE)$importance |>
arrange(-Overall)
most_important <- importance |>
head(10) |>
rownames()
most_important [1] "ManufacturingProcess32" "ManufacturingProcess13" "BiologicalMaterial06"
[4] "ManufacturingProcess09" "ManufacturingProcess36" "BiologicalMaterial03"
[7] "ManufacturingProcess17" "BiologicalMaterial02" "ManufacturingProcess31"
[10] "BiologicalMaterial12"
The following predictors are the most important in the optimal tree-based model, Model Tree: ManufacturingProcess32, ManufacturingProcess13, BiologicalMaterial06, ManufacturingProcess09, ManufacturingProcess36, BiologicalMaterial03, ManufacturingProcess17, BiologicalMaterial02, ManufacturingProcess31, BiologicalMaterial12. Manufacturing processes tend to dominate this list.
The top 10 important predictors from the optimal nonlinear model, the SVM nonlinear regression model, from 7.5 (published on RPubs here) are shown below. They are identical to the important tree-based model predictors, but in a different order.
The top 10 important predictors from the optimal linear model from 6.3 (published on RPubs here) are shown below. All important linear predictors below were also considered important in the tree-based model.
- Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
rpartTree2 <- as.party(rpartTune$finalModel)
plot(rpartTree2)This plot reveals that manufacturing processes are most consequential in predicting yield, with ManufacturingProcess32 determining the split at the first node. The distribution of yield in the terminal nodes shows that the yield is highest when ManufacturingProcess32 is greater than the threshold.