Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
rfImp1 %>%
arrange(Overall) %>%
knitr::kable()
| Overall | |
|---|---|
| V8 | -0.1663626 |
| V9 | -0.0952927 |
| V10 | -0.0749448 |
| V7 | -0.0059617 |
| V6 | 0.1651112 |
| V3 | 0.7635918 |
| V5 | 2.0235246 |
| V2 | 6.4153694 |
| V4 | 7.6151188 |
| V1 | 8.7322354 |
No, Predictors 6-10 were not significantly used in the random forest model. In order of importance, Predictors 1, 4, 2, 5, 3 were significantly used.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 %>%
arrange(Overall) %>%
knitr::kable()
| Overall | |
|---|---|
| V8 | -0.0437056 |
| V7 | -0.0134564 |
| V9 | 0.0084044 |
| V10 | 0.0289481 |
| V6 | 0.1356906 |
| V3 | 0.6297022 |
| V5 | 1.8723844 |
| duplicate1 | 4.2833158 |
| V1 | 5.6911997 |
| V2 | 6.0689606 |
| V4 | 7.0475224 |
V1 is no longer the most important predictor (it dropped to third, followed by its duplicate), V4 is now the most important.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .12
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9173613
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 %>%
arrange(Overall) %>%
knitr::kable()
| Overall | |
|---|---|
| V8 | -0.0866595 |
| V7 | -0.0205298 |
| V9 | -0.0076255 |
| V10 | 0.0086769 |
| V6 | 0.1401401 |
| V3 | 0.4965381 |
| duplicate2 | 1.3234148 |
| V5 | 1.8475705 |
| duplicate1 | 3.6753624 |
| V1 | 5.1739861 |
| V2 | 6.3914741 |
| V4 | 7.3295359 |
After yet another highly correlated predictor to V1, the importance score decreased even more, but its places ordered among the other predictors remains the same.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
# Set seed
set.seed(1298)
# Fitting the conditional inference forest model
# I'm using cforest_unbiased to align with standard random forest parameters
cforest_model <- cforest(y ~ ., data = simulated,
controls = cforest_unbiased(ntree = 1000))
# Calculating traditional importance
cforest_imp_trad <- varimp(cforest_model, conditional = FALSE)
# Calculating conditional importance
cforest_imp_cond <- varimp(cforest_model, conditional = TRUE)
# Printing these out so I can compare the patterns
print("Traditional Importance:")
## [1] "Traditional Importance:"
print(sort(cforest_imp_trad, decreasing = TRUE))
## V4 V2 duplicate1 V1 V5 duplicate2
## 7.217851407 6.031053141 4.528447993 4.393893588 1.631702309 0.856997316
## V7 V9 V3 V8 V10 V6
## 0.025799004 -0.002332035 -0.013544465 -0.013611289 -0.030773410 -0.034665044
print("Conditional Importance:")
## [1] "Conditional Importance:"
print(sort(cforest_imp_cond, decreasing = TRUE))
## V4 V2 V1 duplicate1 V5 duplicate2
## 5.637497021 4.862625987 1.669282212 1.540558120 0.983381446 0.157480139
## V6 V9 V3 V7 V8 V10
## 0.009103388 0.007479687 0.001722000 -0.001547496 -0.009069889 -0.013958336
Looking at the outputs, I can see that the importances do not show the exact same pattern when I switch between traditional and conditional measures. Under the traditional measure, duplicate1 actually scores higher than the original V1 variable, which shows exactly how standard random forests can be biased by highly correlated predictors splitting the importance. However, when I look at the conditional importance (using the Strobl et al. adjustment), V1 regains its spot above duplicate1. The conditional measure does a much better job of penalizing the duplicated information and revealing the true underlying importance, even though the absolute importance values drop across the board.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# Fitting a Stochastic Gradient Boosting (GBM) model
set.seed(1298)
gbm_model <- train(y ~ ., data = simulated, method = "gbm", verbose = FALSE)
gbm_imp <- varImp(gbm_model)
print("Boosted Trees Importance:")
## [1] "Boosted Trees Importance:"
print(gbm_imp)
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 72.9056
## duplicate1 54.0705
## V5 32.9736
## V1 27.1809
## V3 25.9778
## V7 2.9594
## V9 1.8584
## V6 1.4399
## V8 1.0059
## duplicate2 0.7965
## V10 0.0000
# Fitting a Cubist model
set.seed(1298)
cubist_model <- train(y ~ ., data = simulated, method = "cubist")
cubist_imp <- varImp(cubist_model)
print("Cubist Importance:")
## [1] "Cubist Importance:"
print(cubist_imp)
## cubist variable importance
##
## Overall
## V2 100.00
## V4 76.30
## V5 65.93
## V1 62.22
## V3 42.96
## duplicate1 37.04
## duplicate2 22.96
## V6 13.33
## V10 0.00
## V9 0.00
## V7 0.00
## V8 0.00
Looking at the boosted trees (GBM) output, I can see that the correlation bias pattern definitely still occurs, and it is arguably even more pronounced. In this model, duplicate1 is ranked higher than the original V1 variable. The GBM model clearly struggled with the highly correlated predictors, and I think it arbitrarily favored the duplicate over the original.
On the flip side, the Cubist model handled things a bit differently. V1 managed to stay ahead of duplicate1 and duplicate2, but I can still see that the duplicates are bleeding away some of the importance that normally would belong solely to V1 (as V1 is no longer the top predictor like it was before adding the duplicates). So, while Cubist is slightly more robust here than GBM or standard random forests, the overall pattern that I expected where correlated predictors kind of muddy the waters seems true for each of these.
Use a simulation to show tree bias with different granularities.
set.seed(1298)
# Simulating random target variable
y <- rnorm(1000)
# Simulating a continuous predictor (high granularity - lots of unique values)
high_gran_x <- rnorm(1000)
# Simulating a discrete predictor (low granularity - only two unique values)
low_gran_x <- as.factor(sample(0:1, 1000, replace = TRUE))
# Binding them into a single dataframe
sim_data <- data.frame(y, high_gran_x, low_gran_x)
# Fitting a standard random forest model
rf_bias_model <- randomForest(y ~ ., data = sim_data, importance = TRUE)
# Extracting and printing the variable importance
print(varImp(rf_bias_model, scale = FALSE))
## Overall
## high_gran_x -0.01487171
## low_gran_x -0.01082608
# Pulling the node impurity (Type 2 importance) to expose the tree-building bias
gini_imp <- randomForest::importance(rf_bias_model, type = 2)
print(gini_imp)
## IncNodePurity
## high_gran_x 61.574364
## low_gran_x 3.745853
Looking at the node impurity (Gini importance) output, the tree-building bias is pretty obvious. Even though both of these predictors are completely random noise with no real relationship to y, the random forest algorithm gave the continuous high_gran_x a massive importance score of over 61, while the binary low_gran_x got a 3.7.
I think the algorithm was tricked by the continuous variable simply because it had 999 different possible split points to test. That gave the tree 999 chances to find a randomly “lucky” split that slightly reduced the training error at a given node. The binary variable only had one possible split, so it didn’t get the same statistical advantage. This simulation perfectly demonstrates why I have to be extremely careful using standard tree-based impurity metrics when my predictors have wildly different numbers of unique values (granularity)—the algorithm will artificially inflate the importance of the more granular variables.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Looking at the two graphs in Figure 8.24, I can see a shift in how the model distributes “credit” across the predictors. The model on the right is much more “greedy” because it uses a high bagging fraction (0.9) and a high learning rate (0.9). By using almost all the data in every step and taking huge leaps to reduce error, the model quickly locks onto the most dominant predictors like NumCarbon and MolWeight. But then after those have explained the bulk of the variance, there is very little residual error left for the other variables to work with, causing their importance to drop off.
On the other hand, the model on the left is forced to be much more diverse. With a small bagging fraction (0.1), each tree only sees a tiny random slice of the data, and the low learning rate (0.1) ensures that no single tree can make a massive change to the final prediction. This “slow and steady” approach prevents the dominant variables from taking over immediately, forcing the ensemble to discover and utilize the subtle signals from a much wider variety of predictors.
Which model do you think would be more predictive of other samples?
I think the left would be the more predictive one for other samples. The model on the right looks like overfitting; by using such a high learning rate and bagging fraction, it basically “memorized” the strongest patterns in that specific training set without leaving room for nuances. It’s too aggressive. The model on the left looks like a more robust learner. By using small, incremental updates and forced randomness through a lower bagging fraction, it should be a better generalizer. It isn’t over-reliant on any single predictor, so when it encounters new data with slightly different noise, it’s less likely to fall apart than the brittle model on the right.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
If I were to increase the interaction depth of these models, I would expect the steep importance slopes to flatten out quite a bit. Interaction depth dictates how many variables can be involved in a single tree’s decision path; by allowing trees to grow deeper, I am essentially forcing the algorithm to look beyond the most dominant predictors at the root and incorporate a wider pool of secondary features. Instead of the model “greedily” spending its entire learning budget on one or two heavy hitters like MolWeight, deeper trees would be required to evaluate more complex multi-variable interactions, naturally distributing importance scores more broadly across the entire feature set.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
# Loading the chemical manufacturing data
data(ChemicalManufacturingProcess)
# Setting seed for reproducibility
set.seed(1298)
# Creating an 80/20 train/test split consistent with my previous work
train_indices <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
train_data <- ChemicalManufacturingProcess[train_indices, ]
test_data <- ChemicalManufacturingProcess[-train_indices, ]
# Imputing missing values using KNN
pre_process <- preProcess(train_data[, -1], method = "knnImpute")
# Applying the imputation and separating predictors from the target
train_x <- predict(pre_process, train_data[, -1])
train_y <- train_data$Yield
test_x <- predict(pre_process, test_data[, -1])
test_y <- test_data$Yield
# Setting up cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Model 1: Random Forest
set.seed(1298)
rf_fit <- train(x = train_x, y = train_y, method = "rf",
tuneLength = 10, trControl = ctrl)
# Model 2: Stochastic Gradient Boosting (GBM)
set.seed(1298)
gbm_fit <- train(x = train_x, y = train_y, method = "gbm",
tuneLength = 10, trControl = ctrl, verbose = FALSE)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
# Model 3: Cubist
set.seed(1298)
cubist_fit <- train(x = train_x, y = train_y, method = "cubist",
tuneLength = 10, trControl = ctrl)
# Comparing the resampling results
model_comparison <- resamples(list(RF = rf_fit, GBM = gbm_fit, Cubist = cubist_fit))
summary(model_comparison)
##
## Call:
## summary.resamples(object = model_comparison)
##
## Models: RF, GBM, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.4032321 0.6754317 0.7872979 0.7891359 0.8858019 1.2033044 0
## GBM 0.5040326 0.6738755 0.7345773 0.8161674 0.9690085 1.2786984 0
## Cubist 0.4998705 0.6064810 0.7516544 0.7314775 0.8291642 0.9768175 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.5699800 0.8599910 1.0297749 1.0364859 1.217620 1.466586 0
## GBM 0.6247537 0.8743345 1.0818618 1.0701992 1.216204 1.509482 0
## Cubist 0.6024852 0.8268582 0.9604504 0.9465471 1.037470 1.368593 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.4064170 0.6801783 0.7217268 0.7015773 0.7784786 0.8480853 0
## GBM 0.4743066 0.6016696 0.6701852 0.6713285 0.7733403 0.8347773 0
## Cubist 0.5589029 0.7079809 0.7576645 0.7430085 0.8122929 0.8409966 0
# Evaluating performance on the test set for the best looking model
# I'll check all three to be sure
rf_pred <- predict(rf_fit, test_x)
gbm_pred <- predict(gbm_fit, test_x)
cubist_pred <- predict(cubist_fit, test_x)
postResample(rf_pred, test_y)
## RMSE Rsquared MAE
## 1.4269929 0.5861494 1.1635706
postResample(gbm_pred, test_y)
## RMSE Rsquared MAE
## 1.2696736 0.6657949 1.0118451
postResample(cubist_pred, test_y)
## RMSE Rsquared MAE
## 1.3988727 0.5496768 1.0896740
After comparing the three tree-based models, I found that the Cubist model performed the best during the resampling phase, boasting the lowest mean RMSE (0.9465) and the highest mean R-squared (0.7430). However, when I applied these models to my unseen test data, the Stochastic Gradient Boosting (GBM) model actually took the lead. The GBM model achieved a test RMSE of 1.2697 and an R-squared of 0.6658, outperforming both the Random Forest and the Cubist models on those same samples. Even though Cubist looked slightly stronger in-sample, the GBM model demonstrated better generalization in this specific run, making it my choice for the optimal model for this process.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
# Pulling the variable importance for my optimal GBM model
gbm_imp_87 <- varImp(gbm_fit, scale = FALSE)
# Plotting the top 20 to get a good look at the leaders
plot(gbm_imp_87, top = 20, main = "Top Predictors for GBM Model")
# Printing the top 10 specifically to compare with my previous homeworks
print(gbm_imp_87)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 489.31
## ManufacturingProcess13 119.24
## ManufacturingProcess17 118.84
## BiologicalMaterial06 116.27
## ManufacturingProcess09 95.87
## BiologicalMaterial11 91.48
## BiologicalMaterial04 56.47
## BiologicalMaterial03 49.36
## BiologicalMaterial05 42.65
## ManufacturingProcess15 36.13
## BiologicalMaterial12 34.22
## BiologicalMaterial09 32.16
## ManufacturingProcess05 31.10
## ManufacturingProcess11 28.09
## ManufacturingProcess06 25.12
## ManufacturingProcess22 24.51
## ManufacturingProcess24 24.29
## ManufacturingProcess04 24.18
## ManufacturingProcess39 20.87
## ManufacturingProcess19 20.30
In my optimal GBM model, the top ten predictors are a mix of both biological and manufacturing variables, though the manufacturing processes definitely have a stronger presence at the very top. Specifically, ManufacturingProcess32 is the undisputed leader, with an importance score that towers over everything else. Out of the top ten, six are manufacturing processes and four are biological materials.
When I compare these to the top ten from my previous linear and nonlinear models, some familiar faces remain, like ManufacturingProcess32 and BiologicalMaterial06. However, the tree-based model has elevated a few process variables that weren’t as prominent before, like ManufacturingProcess13 and ManufacturingProcess17. It’s interesting to see that while the linear models gave a more even split (nearly 50/50) between biological and process variables, this tree-based approach seems to find more unique predictive value in specific manufacturing steps, pushing the ratio more in favor of the “Process” side.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
# Fitting a single regression tree to visualize the top-level logic
set.seed(1298)
single_tree <- rpart(Yield ~ ., data = train_data, control = rpart.control(cp = 0.01))
# Plotting the tree with yield distributions at the terminal nodes
rpart.plot(single_tree,
type = 4,
extra = 101,
under = TRUE,
cex = 0.8,
box.palette = "GnBu",
main = "Single Regression Tree for Chemical Yield")
Visualizing the single regression tree gives me a much clearer window into the decision logic that drives chemical yield. The root node of my tree is ManufacturingProcess32 (with a split at 158.5), which confirms its massive importance score from the earlier models; it is essentially the “primary filter” for the entire process. Once that first manufacturing split is made, the tree immediately branches into BiologicalMaterial11 and BiologicalMaterial03.
This structure tells me that while manufacturing processes set the overall baseline for the yield, biological materials act as the critical secondary factors that fine-tune the final result within those manufacturing constraints. Out of the ten terminal nodes I can see in the text hierarchy, the majority are determined by manufacturing steps (like ManufacturingProcess09, 17, and 39), but they are often nested within a biological branch. This suggests a highly interactive relationship where the impact of a specific manufacturing process depends heavily on the quality of the biological input being used at that time.