library(caret)
library(party)
library(dplyr)
library(pls)
library(pdp)
library(Cubist)
library(xgboost)
library(randomForest)
options(scipen=999)
library(earth)
library(Formula)
library(plotmo)
library(plotrix)
library(DiagrammeR)
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
Week 12 Trees and Rules
8.1 Simulated Data from Chapter 7
8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
<- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
simulated colnames(simulated)[ncol(simulated)] <- "y"
Fit and Estimate
Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, quite the opposite. V1, V2 and V4 are the important predictors. V6, V7, V8, V9, and V10 contribute very little when predicting.
<- randomForest(y ~ ., data = simulated,
model1 importance = TRUE,
ntree = 1000)
<- varImp(model1, scale = FALSE)
rfImp1 print(rfImp1)
Overall
V1 8.62743275
V2 6.27437240
V3 0.72305459
V4 7.50258584
V5 2.13575650
V6 0.12395003
V7 0.02927888
V8 -0.11724317
V9 -0.10344797
V10 0.04312556
Add an Additional Predictor
Now add an additional predictor that is highly correlated with one of the informative predictors. For example: Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
It definitely splits the difference between V1 and the new predictor. Beyond that, the rest of the variables are pretty close to what they were before. I suppose that’s why they say random forest can handle collinear data better than other models can.
$duplicate1 <- simulated$V1 + rnorm(200) * .1
simulatedcor(simulated$duplicate1, simulated$V1)
[1] 0.9485201
<- randomForest(y ~ ., data = simulated,
model2 importance = TRUE,
ntree = 1000)
<- varImp(model2, scale = FALSE)
rfImp2 print(rfImp2)
Overall
V1 6.774034589
V2 6.426340527
V3 0.613805379
V4 7.135941576
V5 2.135242904
V6 0.171933358
V7 0.142238552
V8 -0.073192083
V9 -0.098719872
V10 -0.009701234
duplicate1 3.084990840
Conditional Inference Trees
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Here’s what I found about this test.
- Using one Mississippi two Mississippi, I found conditional=TRUE to take about 25X longer that conditional=FALSE
- the two collinear variables V1 and duplicate1 are very much deprioritized.
- Besides those colinear variables, the relative importance seems somewhat similar.
<- cforest(y ~ ., data = simulated, controls=cforest_unbiased(ntree=1000, mtry=2))
model3
<- varimp(model3, conditional=FALSE)
rfImp3False <- data.frame(Overall = rfImp3False)
rfImp3False_df rownames(rfImp3False_df) <- names(rfImp3False)
<- varimp(model3, conditional=TRUE)
rfImp3True <- data.frame(Overall = rfImp3True)
rfImp3True_df rownames(rfImp3True_df) <- names(rfImp3True)
print(rfImp3False_df)
print(rfImp3True_df)
Use Boosted and Cubist
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur? (refer to figure 8.24)
For both, yes, it’s about the same. V1, V2, V4, V5 and duplicate1 are the more used features.
<- simulated[, !names(simulated) %in% "y"] # This excludes 'y' by name
predictors <- cubist(x = predictors, y = simulated$y, committees = 1)
modelCubist
<- xgb.DMatrix(
data_matrix data = as.matrix(simulated[, !names(simulated) %in% "y"]),
label = simulated$y
)
<- xgboost(data = data_matrix,
modelXGBoost max.depth = 6, eta = 0.3, nthread = 2,
nrounds = 100, objective = "reg:squarederror", verbose=0)
<- xgb.importance(
importance_matrix feature_names = colnames(simulated[, !names(simulated) %in% "y"]), model = modelXGBoost)
summary(modelCubist)
Call:
cubist.default(x = predictors, y = simulated$y, committees = 1)
Cubist [Release 2.07 GPL Edition] Sat Apr 13 22:27:25 2024
---------------------------------
Target attribute `outcome'
Read 200 cases (12 attributes) from undefined.data
Model:
Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
Evaluation on training data (200 cases):
Average |error| 2.105826
Relative |error| 0.52
Correlation coefficient 0.86
Attribute usage:
Conds Model
100% V1
100% V2
100% V4
100% V5
Time: 0.0 secs
print(importance_matrix)
Feature Gain Cover Frequency
1: V1 0.325855544 0.09299500 0.18158568
2: V2 0.269424086 0.07885144 0.11764706
3: V4 0.264327403 0.10098633 0.09718670
4: V5 0.059889674 0.11897229 0.09514066
5: V3 0.032557346 0.14443508 0.11099744
6: V7 0.016198820 0.08972184 0.08235294
7: V6 0.009710453 0.07210807 0.06905371
8: duplicate1 0.007658205 0.08714928 0.05626598
9: V10 0.005966073 0.08680993 0.05984655
10: V8 0.004320903 0.05923436 0.07058824
11: V9 0.004091492 0.06873638 0.05933504
8.2 Tree Granularity Bias
Use a simulation to show tree bias with different granularity.
I had difficult time coming up with this until I heard an example that made sense to me. And I’d like to keep this example here in my notes so I can find it later.
Imagine you are trying to decide where to plant different types of plants in a garden based on sunlight and soil type
- Soil type: You have a map that categorizes soil into different types based on its nutrients (granularity varies).
- Sunlight: Varies continuously throughout the garden.
If your map (granularity of X1) is very detailed, you might focus a lot on tiny differences in soil nutrients and miss broader patterns of sunlight (X2). Conversely, if the soil map is very basic, you might pay more attention to sunlight variations. This is analogous to how decision trees might develop biases based on the characteristics of the data they receive.
library(rpart)
set.seed(123) # for reproducibility
<- function(observations, granularity) {
show_tree
## Var 1 - different levesl of granularity. If granularity is 10, X1 will have values like 0, 0.1, 0.2, up to 0.9. If granularity is 100, X1 will have values like 0, 0.01, 0.02, up to 0.99, making it much more detailed.
<- round(runif(observations, 0, 1) * granularity) / granularity
soil_type
## Var 2, normal distribution data.
<- rnorm(observations)
sunlight
<- soil_type + sunlight + rnorm(observations, sd=0.5) > 1.5 # Binary outcome
Y
# Fit a decision tree, Y = true or false
<- data.frame(soil_type, sunlight, Y)
data <- rpart(Y ~ soil_type + sunlight, data=data, method="class")
model
# Plot the tree
plot(model)
text(model, use.n=TRUE)
}
show_tree(200, 10) # low granularity
show_tree(200, 100) # high granularity
To explain that, the top tree has a single split. The soil_type variable, which has low granularity, is not used at all. The absence of splits on soil_type could suggest a bias towards the more continuous variable (sunlight), as it didn’t find soil_type informative enough due to its coarse categories.
The bottom chart splits on both soil type and sunlight, indicating that the higher granularity of soil_type provided more informative splits. After the initial split, we see splits based on soil_type at soil_type < 0.675 and soil_type < 0.375. These splits are evidence that with more granularity, the soil_type variable becomes informative enough to be used for making decisions. The tree now shows a bias towards using soil_type alongside sunlight, demonstrating that when provided with more detailed categories, the decision tree incorporates soil_type into its model.
8.3 Gradient Boosting
In stochastic gradient boosting, the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
This has to do with the bagging fraction. The left has a lower fraction, which means the left was trained on a smaller subset of data. Therefore, it must consider a broad range of predictors to make accurate predictions across the datasets. The one on the right, the algorithm sees mots of the data, which means the model can more reliably find teh right predictors.
After having gone through all this, I only wonder which model is more accurate. I suppose this is another lever we have at our disposal.
Which model do you think would be more predictive of other samples?
According to what I read, the one on the left. It spreads learning across more predictors to learn slowly. The one on the right is a little more fit just to that specific dataset.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
I think it would impact the one on the left. The one on the right with that .9 is pretty much overfit to that data and not leveraging most of the other features. The column on the left has more room to be impacted.
8.7 More On Chemical Manufacturing
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models
I’ll pull in the last several, and I’ll add in XGBoost.
<- preProcess(ChemicalManufacturingProcess, method='medianImpute')
imputed_data <- predict(imputed_data, ChemicalManufacturingProcess)
processed_data set.seed(123)
<- createDataPartition(processed_data$Yield, p = 0.80, list = FALSE)
splitIndex <- processed_data[splitIndex, ]
train_data <- processed_data[-splitIndex, ]
test_data
# KNN
<- train(Yield ~ ., data = train_data,
knnModel method = "knn",
preProcess = c("center", "scale"),
tuneLength = 5)
<- predict(knnModel, newdata = test_data)
knnPred <- postResample(pred = knnPred, obs = test_data$Yield)
knnMetrics
# MARS
<- train(Yield ~ ., data = train_data,
marsModel method = "earth",
preProcess = c("center", "scale"),
tuneLength = 5)
<- predict(marsModel, newdata = test_data)
marsPred <- postResample(pred = marsPred, obs = test_data$Yield)
marsMetrics
# SVM
<- train(Yield ~ ., data = train_data,
svmModel method = "svmRadial",
preProcess = c("center", "scale"),
tuneLength = 5)
<- predict(svmModel, newdata = test_data)
svmPred <- postResample(pred = svmPred, obs = test_data$Yield)
svmMetrics
# Random Forest
<- train(Yield ~ ., data = train_data,
rfModel method = "rf",
preProcess = c("center", "scale"),
tuneLength = 5)
<- predict(rfModel, newdata = test_data)
rfPred <- postResample(pred = rfPred, obs = test_data$Yield)
rfMetrics
# Isn't working through caret, will use directly.
<- list(
params booster = "gbtree",
objective = "reg:squarederror",
eta = 0.3,
max_depth = 6
)<- xgb.DMatrix(data = as.matrix(train_data[-which(names(train_data) == "Yield")]),
dtrain label = train_data$Yield)
# Train the model
<- xgb.train(params = params, data = dtrain, nrounds = 100)
xgbModel
# Predict
<- predict(xgbModel, as.matrix(test_data[-which(names(test_data) == "Yield")]))
xgbPred <- postResample(pred = xgbPred, obs = test_data$Yield)
xgbMetrics
# Combine model performance metrics
<- data.frame(
modelPerformance RMSE = c(knnMetrics[1], marsMetrics[1], svmMetrics[1], rfMetrics[1], xgbMetrics[1]),
Rsquared = c(knnMetrics[2], marsMetrics[2], svmMetrics[2], rfMetrics[2], xgbMetrics[2]),
MAE = c(knnMetrics[3], marsMetrics[3], svmMetrics[3], rfMetrics[3], xgbMetrics[3]),
row.names = c("KNN", "MARS", "SVM", "RF", "XGBoost")
)
modelPerformance
RMSE Rsquared MAE
KNN 1.406333 0.4224466 1.1606875
MARS 1.402311 0.4244340 1.1104190
SVM 1.228647 0.5562600 1.0257624
RF 1.277703 0.5282280 0.9720912
XGBoost 1.028613 0.6809667 0.7879700
And the winner is
Which tree-based regression model gives the optimal resampling and test set performance?
Wow, XGBoost with the big win!
Predictors
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
The variables are mostly the same, but the order is slightly different. You can see there are little differences between which model puts weight on which features.
#XGBoost
<- xgb.importance(feature_names = colnames(dtrain)[!colnames(dtrain) %in% "Yield"], model = xgbModel)
importance_matrix <- importance_matrix[order(-importance_matrix$Gain),]
sorted_importance_matrix <- head(sorted_importance_matrix, 10)
important_features print(important_features)
Feature Gain Cover Frequency
1: ManufacturingProcess32 0.13697775 0.022868585 0.01449275
2: BiologicalMaterial03 0.10386130 0.027716303 0.01630435
3: ManufacturingProcess31 0.09515613 0.006512804 0.00634058
4: ManufacturingProcess17 0.09349432 0.021245653 0.01539855
5: ManufacturingProcess13 0.07765902 0.023711666 0.01811594
6: BiologicalMaterial09 0.05815033 0.040805143 0.02898551
7: BiologicalMaterial12 0.04474019 0.018041943 0.01086957
8: BiologicalMaterial02 0.04312171 0.130677627 0.07065217
9: ManufacturingProcess43 0.03886919 0.015554853 0.02536232
10: ManufacturingProcess22 0.02905526 0.029149542 0.02355072
varImp(rfModel)$importance |>
arrange(desc(Overall)) |>
head(10)
Overall
ManufacturingProcess32 100.00000
BiologicalMaterial06 19.79845
ManufacturingProcess31 18.65990
BiologicalMaterial03 18.50846
ManufacturingProcess17 16.30797
BiologicalMaterial12 13.62271
ManufacturingProcess28 11.15292
ManufacturingProcess13 10.96805
ManufacturingProcess36 10.95055
ManufacturingProcess09 10.82006
Tree Plots
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
There are two bits of information that stand out. Firstly, there are only two root nodes that start with ManufacturingProcess32. One uses just that as a decision, the second uses several variables. I figured there would have been a lot more trees with that root as the root node.
Second, BiologicalMaterial03 and ManufacturingProcess31 which are the second and third highest gain variables don’t seem to be in that second tree MFP32 root node tree at all. This learning model learns from itself after multiple tries, but I thought it would be there.
The exact behavior of how this model works is still a little confusing, so added support for Node 0, 1, and 2. And with the same results, those other important features are still not listed.
<- xgb.model.dt.tree(feature_names = colnames(dtrain), model = xgbModel)
trees_data <- unique(trees_data[Feature == 'ManufacturingProcess32' & (Node == 0 | Node == 1 |Node == 2), "Tree"])
trees_with_ManufacturingProcess32 <- trees_with_ManufacturingProcess32$Tree
tree_numbers xgb.plot.tree(model = xgbModel, trees = tree_numbers)