Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6– V10)?
print(rfImp1)
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
The column Overall shows the importance scores for each predictor. In this case, we can see that the informative predictors (V1 to V5) have much higher importance scores compared to the uninformative predictors (V6 to V10). This indicates that the random forest model did not significantly use the uninformative predictors.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
We used the code
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
to efficiently create a new predictor that is highly correlated with
\(V_1\) but not perfectly identical so
we can observe the effect on variable importance. After that we will
recalculate the importance for the model.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9447637
set.seed(300)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
## Overall
## V1 6.40030160
## V2 6.54081698
## V3 0.58055721
## V4 6.89455554
## V5 2.18646985
## V6 0.15925309
## V7 -0.02562171
## V8 -0.06348011
## V9 -0.02716102
## V10 -0.01571937
## duplicate1 3.60470237
From cor(simulated$duplicate1, simulated$V1) we are able
to confirm the high correlation. For the new predictor
duplicate1 we got Overall importance score of 3.604 and V1
got 6.40, the importance for V1 dropped significantly. This confirms
that when a highly correlated variable (duplicate1) is introduced, it
takes some of the importance that used to belong to V1 (previously
8.73).
When two variables carry nearly the same predictive information (correlation of 0.946), they compete to be chosen at each split in the trees. The model splits the “credit” for the predictive power between them. V1 retained a higher score than duplicate1 (6.40 vs. 3.60) only because duplicate1 contained added noise, making V1 the slightly stronger signal. If we were to add a third correlated predictor (duplicate2), the importance scores for V1 and duplicate1 would decrease even further. The total predictive signal would be roughly the same, but the importance credit would now be split across three variables instead of two, making each individual variable appear less important than it truly is.
Random Forest variable importance scores can be misleading in the presence of high collinearity, as the importance is spread out among the correlated group rather than assigned to a single representative variable.
In this part of the exercise we will fit a random forest model using conditional inference trees from the party package and calculate the variable importance using the varimp function with the conditional argument.
imp_unconditional: This is similar to the standard Random Forest importance
imp_conditional: This adjusts for correlations between predictors
library(party)
set.seed(400)
cforest_model <- cforest(y ~ ., data = simulated, controls = cforest_unbiased(mtry = 3, ntree = 1000))
imp_unconditional <- varimp(cforest_model, conditional = FALSE)
imp_conditional <- varimp(cforest_model, conditional = TRUE)
cat("\n--- Unconditional Importance (V1 vs duplicate1) ---\n")
##
## --- Unconditional Importance (V1 vs duplicate1) ---
print(imp_unconditional[c("V1", "duplicate1")])
## V1 duplicate1
## 5.791714 2.946044
cat("\n--- Conditional Importance (V1 vs duplicate1) ---\n")
##
## --- Conditional Importance (V1 vs duplicate1) ---
print(imp_conditional[c("V1", "duplicate1")])
## V1 duplicate1
## 2.494006 1.353788
The unconditional importance scores show a similar pattern to the traditional Random Forest model, where both V1 and duplicate1 have significant importance scores, with V1 being higher. This indicates that both variables are being used by the model, but the importance is split between them due to their high correlation. In contrast, the conditional importance scores show a different pattern. Here, V1 has a much higher importance score compared to duplicate1, which has a very low score. This suggests that when accounting for the correlation between predictors, the model recognizes that V1 is the primary informative variable, while duplicate1 does not add much unique information. This demonstrates that conditional inference forests are better at identifying the unique contribution of a predictor when multicollinearity is present.
I will use the gbm package for boosted trees and the Cubist package for Cubist models to fit models to the same data and calculate variable importance.
library(caret)
library(gbm)
set.seed(200)
gbm_model <- train(y ~ ., data = simulated,
method = "gbm",
verbose = FALSE)
gbm_imp <- varImp(gbm_model, scale = FALSE)
library(Cubist)
set.seed(200)
cubist_model <- train(y ~ ., data = simulated,
method = "cubist")
cubist_imp <- varImp(cubist_model, scale = FALSE)
After fitting both models, we can examine the variable importance scores for V1 and duplicate1.
cat("\n--- Boosted Trees Importance (V1 vs duplicate1) ---\n")
##
## --- Boosted Trees Importance (V1 vs duplicate1) ---
print(gbm_imp$importance[c("V1", "duplicate1"), , drop = FALSE])
## Overall
## V1 3025.34
## duplicate1 1250.66
cat("\n--- Cubist Importance (V1 vs duplicate1) ---\n")
##
## --- Cubist Importance (V1 vs duplicate1) ---
print(cubist_imp$importance[c("V1", "duplicate1"), , drop = FALSE])
## Overall
## V1 70.0
## duplicate1 8.5
Random Forest is sensitive to collinearity and will split importance among correlated variables. In contrast, Boosted Trees and Cubist are robust to redundancy, effectively identifying and selecting the single best driver (V1) while pruning out the noisier, correlated copy. Unlike Random Forest, which introduces randomness by forcing trees to split on random subsets of features, GBM and Cubist use greedy algorithms. They aggressively search for the single variable that minimizes error at each step.
Use a simulation to show tree bias with different granularities.
In order to resolve this part I will simulate data where the response variable is pure random noise, and we have two predictors: one with high granularity (continuous) and one with low granularity (binary). I will fit a decision tree to this data multiple times and record which predictor is chosen for the primary split. This will help illustrate the bias of decision trees towards predictors with more distinct values.
rnorm(n): generates n random numbers from a standard normal distribution (mean = 0, sd = 1).
x_high: a continuous predictor with high granularity
x_low: a binary predictor with low granularity (0 or 1)
rpart(y ~ ., data = df, control = rpart.control(maxdepth = 1, cp = -1)): fits a decision tree to predict y using both predictors, forcing the tree to make at least one split.
library(rpart)
high_granularity_wins <- 0
low_granularity_wins <- 0
no_split <- 0
set.seed(55)
for (i in 1:100) {
n <- 200
y <- rnorm(n)
x_high <- rnorm(n)
x_low <- sample(0:1, n, replace = TRUE)
df <- data.frame(y, x_high, x_low)
model <- rpart(y ~ ., data = df, control = rpart.control(maxdepth = 1, cp = -1))
frame <- model$frame
if (nrow(frame) > 1) {
split_var <- as.character(frame$var[1])
if (split_var == "x_high") {
high_granularity_wins <- high_granularity_wins + 1
} else {
low_granularity_wins <- low_granularity_wins + 1
}
} else {
no_split <- no_split + 1
}
}
results <- data.frame(
Predictor_Type = c("High Granularity (Continuous)", "Low Granularity (Binary)"),
Times_Selected = c(high_granularity_wins, low_granularity_wins)
)
print(results)
## Predictor_Type Times_Selected
## 1 High Granularity (Continuous) 92
## 2 Low Granularity (Binary) 8
To demonstrate the inherent selection bias in regression trees, a simulation was conducted using a response variable and two predictors generated entirely from random noise. One predictor was high-granularity (continuous), and the other was low granularity (binary). Even though both predictors contained only random noise and had zero true relationship with the response variable (\(Y\)), the regression tree chose the High Granularity (Continuous) variable 92% of the time.
This simulation confirms that when predictors vary widely in their number of unique values, standard variable importance measures in single trees can be misleading, as they are heavily biased toward predictors with higher cardinality.
In stochastic gradient boosting the bagging fractionand learningrate will govern the construction of the trees as they are guided by the gradient. Although theoptimal values of the separameters should be obtained through the tuning process, it is helpful tounder stand how the magnitudes of these parameters affect magnitudes of variable importance. Figure8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction(0.1and0.9) and the learning rate(0.1and0.9)for the solubility data.The left-hand plot has both parameters set to 0.1,and the right-hand plot has both set to 0.9:
(a)Why does the model on the right focus its importance on just the first few of predictors, where as the model on the left spreads importance across more predictors?
The model on the right (high bagging fraction and learning rate) concentrates importance on a small set of predictors because it fits the data aggressively. A large bagging fraction ensures that the strongest predictors are included in nearly every tree, while the high learning rate allows these predictors to make large corrections early in the boosting process. As a result, most of the error is eliminated by the dominant variables, leaving little opportunity for weaker predictors to accumulate importance.
In contrast, the model on the left (low bagging fraction and learning rate) distributes importance across a wider range of predictors. With a small subsample, the best predictors are often missing from individual trees, forcing the algorithm to rely on alternative or correlated variables. The low learning rate also slows down the learning process, requiring many more boosting iterations, which naturally spreads importance across more predictors.
(b)Which model do you think would be more predictive of other samples?
The model with the lower learning rate and lower bagging fraction (the left model) would likely be more predictive of new samples. This model learns more slowly and incorporates more randomness, causing it to spread importance across more predictors. These characteristics help reduce overfitting and typically lead to better generalization performance on unseen data compared to the more aggressive model on the right.
(c)How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24
Increasing interaction depth gives the model more opportunities to use additional variables in deeper splits. While the strongest predictors are still selected for the primary splits, deeper trees force the algorithm to select secondary variables to capture residuals and interactions in the lower branches. As a result, the overall importance slope usually becomes flatter, as the importance is distributed across a wider number of predictors rather than being concentrated solely on the top few.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
In order to answer we will fit several tree based regression models including single decision trees, random forests, Gradient Boosting, and Cubist models. We will use the same data imputation, splitting, and preprocessing steps as in previous exercises. The model with the best resampling performance (e.g., lowest RMSE) on the training set and the best performance on the test set will be considered optimal.
library(caret)
library(AppliedPredictiveModeling)
library(dplyr)
library(tidyr)
library(rpart)
library(randomForest)
library(gbm)
library(Cubist)
data(ChemicalManufacturingProcess)
process_data <- as.data.frame(ChemicalManufacturingProcess)
process_data <- process_data |>
mutate(across(where(is.numeric), ~tidyr::replace_na(.x, median(.x, na.rm = TRUE))))
set.seed(42)
inTrain <- createDataPartition(process_data$Yield, p = 0.8, list = FALSE)
train_df <- process_data[inTrain, ]
test_df <- process_data[-inTrain, ]
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
Now that the data is ready we will train the four models representing the evolution of three based methods:
Single Regression Tree (CART): A basic decision tree model that splits the data based on predictor values to minimize error.
Random Forest (RF): An ensemble of decision trees built on bootstrapped samples of the data, using random subsets of predictors at each split to reduce correlation among trees.
Stochastic Gradient Boosting (GBM): An ensemble method that builds trees sequentially, where each new tree focuses on correcting the errors of the previous trees.
Cubist (Model Tree): A rule based model that combines decision trees with linear regression models in the terminal nodes.
set.seed(42)
cartModel <- train(Yield ~ ., data = train_df,
method = "rpart",
tuneLength = 10,
trControl = control)
set.seed(42)
rfModel <- train(Yield ~ ., data = train_df,
method = "rf",
tuneLength = 5,
trControl = control)
set.seed(42)
gbmModel <- train(Yield ~ ., data = train_df,
method = "gbm",
verbose = FALSE,
tuneLength = 5,
trControl = control)
set.seed(42)
cubistModel <- train(Yield ~ ., data = train_df,
method = "cubist",
tuneLength = 5,
trControl = control)
The results will be compiled to identify the model with the lowes RMSE.
models <- list(
CART = cartModel,
RF = rfModel,
GBM = gbmModel,
Cubist = cubistModel
)
resampling_results <- resamples(models)
summary(resampling_results, metric = "RMSE")
##
## Call:
## summary.resamples(object = resampling_results, metric = "RMSE")
##
## Models: CART, RF, GBM, Cubist
## Number of resamples: 30
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.8294915 1.2394079 1.4631173 1.442301 1.704645 1.818668 0
## RF 0.6853131 0.9682762 1.0821275 1.086752 1.212917 1.555958 0
## GBM 0.6881648 0.9589184 1.0955852 1.088682 1.214744 1.574065 0
## Cubist 0.5893627 0.8353868 0.9603153 1.008188 1.127322 2.140472 0
test_results <- data.frame(Model = character(), Test_RMSE = numeric())
for (name in names(models)) {
pred <- predict(models[[name]], newdata = test_df)
rmse <- sqrt(mean((pred - test_df$Yield)^2))
test_results <- test_results |>
bind_rows(data.frame(Model = name, Test_RMSE = rmse))
}
cat("\n--- Final Test Set Performance (Lowest RMSE is Best) ---\n")
##
## --- Final Test Set Performance (Lowest RMSE is Best) ---
test_results |> arrange(Test_RMSE)
## Model Test_RMSE
## 1 Cubist 0.9441946
## 2 GBM 1.1261202
## 3 RF 1.1737473
## 4 CART 1.5027045
Based on the results, the Cubist model gives the optimal performance, achieving the lowest Test Set RMSE (0.944) compared to GBM (1.126) and Random Forest (1.174).
Cubist is considered the best performer because it creates a ‘Model Tree’ where the leaves are not just single numbers, but linear regression models. This hybrid approach allows it to capture the non linear structure (via the tree splits) and the smooth, additive relationships (via the linear models at the leaves) simultaneously. It is particularly effective on this chemical data, where the relationship between process variables and yield likely involves smooth trends rather than just the step functions found in standard trees.
library(caret)
library(Cubist)
cubist_imp <- varImp(cubistModel, scale = FALSE)
print(cubist_imp)
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 52.0
## ManufacturingProcess17 26.0
## ManufacturingProcess09 26.0
## ManufacturingProcess33 19.0
## BiologicalMaterial06 16.0
## ManufacturingProcess04 15.0
## ManufacturingProcess39 14.5
## ManufacturingProcess13 14.5
## ManufacturingProcess27 12.0
## BiologicalMaterial03 11.5
## ManufacturingProcess29 9.5
## ManufacturingProcess31 9.0
## ManufacturingProcess25 7.5
## BiologicalMaterial02 6.0
## BiologicalMaterial11 6.0
## ManufacturingProcess15 5.5
## ManufacturingProcess06 5.5
## ManufacturingProcess37 5.5
## ManufacturingProcess26 5.0
## ManufacturingProcess11 4.5
plot(cubist_imp, top = 20, main = "Variable Importance: Cubist Model")
In the optimal tree based regression model (Cubist), the most important predictors are overwhelmingly manufacturing process variables rather than biological starting material variables, with ManufacturingProcess32 consistently ranking as the single most important factor. The process variables dominate the top of the list (occupying 8 of the top 10 spots).
When comparing the top 10 predictors from this tree based model to those from optimal linear and nonlinear models, there is strong overlap in the major process variables across all model types. However, the tree based model identifies a slightly wider variety of predictors because it can exploit both interactions (via splits) and localized linear fits (via leaf models), whereas the linear model emphasizes global main effects and the nonlinear model emphasizes flexible thresholds.
library(rpart.plot)
rpart.plot(cartModel$finalModel,
type = 4,
extra = 101,
main = "Optimal Single Regression Tree (Yield)")
Yes. Visualizing the optimal single regression tree provides important “operational” insight that variable importance scores alone cannot reveal. While part (b) showed that process variables were the strongest predictors, the tree plot shows how these variables interact to determine yield, effectively giving us a concrete, interpretable recipe for high yield production.
The tree immediately highlights ManufacturingProcess32 as the primary decision point. This variable acts as a “gatekeeper”: when MP32 falls below ~160, the average yield drops sharply. This is a practical threshold that is not obvious from importance rankings alone. Below this top split, the next divisions involve ManufacturingProcess17 and ManufacturingProcess13, illustrating a clear hierarchy of process control. These top level splits emphasize that stabilizing the manufacturing conditions is essential for achieving good performance.
The biological variables (e.g., BiologicalMaterial03, BiologicalMaterial05) appear only deeper in the tree. This suggests that the biological inputs fine tune the yield after the essential process conditions are within acceptable ranges. In other words, process variables determine the major yield structure, and biological variables adjust smaller variations within those process defined regions.
Overall, the tree visualization reinforces the conclusion that process variables dominate the yield mechanism and provides specific operational thresholds that are directly interpretable for decision making in the manufacturing process.