# Load libraries
library(caret)
library(AppliedPredictiveModeling)
library(mlbench)
library(partykit)
library(Cubist)
library(randomForest)
library(party)
library(gbm)

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.


Exercise 8.1


Simulated Data

Recreate the simulated data from Exercise 7.2.

# Set the seed
set.seed(200)

# Generate simulated data using Friedman1 function
simulated <- mlbench.friedman1(200, sd = 1)

# Combine predictors and response into a single data frame
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)

# Rename the last column to "y" (response variable)
colnames(simulated)[ncol(simulated)] <- "y"


Random Forest

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No - the Random Forest does not significantly use the uninformative predictors. The low or negative importance scores for V6 - V10 indicate that these variables had little to no impact. Variables V6 - V10 were ignored with importance scores near zero or negative. V1, V2 and V4 contributed the most the model’s predictive power.

set.seed(200)

# Build the random forest
model1 <- randomForest(y ~ ., 
                       data = simulated, 
                       importance = TRUE, 
                       ntree = 1000)

# Compute variable importance
rfImp1 <- varImp(model1, scale = FALSE)

# Display variable importance
rfImp1
##        Overall
## V1   8.5406729
## V2   6.4511374
## V3   0.7470911
## V4   7.6128264
## V5   2.2684088
## V6   0.1190456
## V7   0.1021863
## V8  -0.1632359
## V9  -0.1296344
## V10 -0.0564990


Correlated Predictor

Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

When a highly correlated predictor is added to the random forest, the importance value of V1 was reduced. The correlated predictors of V1 and duplicate1 compete for importance in each tree split; their individual importance scores are diluted as RF arbitrarily chooses between them. The predictive power is distributed between the correlated features; neither may appear as highly important individually and may mask truly important features.

set.seed(200)

# Add highly correlated predictor with V1
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

# Calculate the correlation
correlation <- cor(simulated$duplicate1, simulated$V1)

# Print the correlation
print(correlation)
## [1] 0.9497025
# Build random forest model with highly correlated predictor
model2 <- randomForest(y ~ ., 
                       data = simulated, 
                       importance = TRUE, 
                       ntree = 1000)

# Compute variable importance
rfImp2 <- varImp(model2, scale = FALSE)

# Display the variable importance
rfImp2
##                 Overall
## V1          6.062482986
## V2          6.090608279
## V3          0.643432061
## V4          6.929862226
## V5          2.141978654
## V6          0.207868621
## V7         -0.026533024
## V8         -0.094702352
## V9          0.003902854
## V10        -0.079523621
## duplicate1  4.084363474


Conditional Tree

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

With conditional inference trees and this data, the pattern remains the same for V2 and V4 remaining important - though slightly reduced. The uninformative variables (V6–V10) are near-zero or negative scores and ignored. The change in pattern of the conditional over the traditional importance targets V1 and its correlated duplicate1 predictor. Both have been reduced significantly with duplicate1 lowered more so. The reduction reflects the adjustment for the correlated explanatory power between V1 and duplicate1.

# Set seed
set.seed(200)

# Fit a random forest model using conditional inference trees
cforest_model <- cforest(y ~ .,
                         data = simulated,
                         control = cforest_control(
                           ntree = 1000,
                           mtry = 5))

# Calculate traditional variable importance
varimp_traditional <- varimp(cforest_model, conditional = FALSE)

# Calculate conditional variable importance
varimp_conditional <- varimp(cforest_model, conditional = TRUE)

# Print variable importance
print("Traditional Variable Importance:")
## [1] "Traditional Variable Importance:"
print(varimp_traditional)
##           V1           V2           V3           V4           V5           V6 
##  6.708312002  6.466771844  0.038456900  8.307774215  2.097853877  0.029700732 
##           V7           V8           V9          V10   duplicate1 
##  0.069557380 -0.018926189 -0.035472347 -0.004772707  2.854234955
print("Conditional Variable Importance:")
## [1] "Conditional Variable Importance:"
print(varimp_conditional)
##           V1           V2           V3           V4           V5           V6 
##  1.541583201  3.850850812  0.018334131  5.179760111  0.799189889  0.005908606 
##           V7           V8           V9          V10   duplicate1 
##  0.020205315 -0.012054079 -0.003961872 -0.006614206  0.864530427


Boosted & Cubist

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The boosted model also prioritizes V4, V2, and V1 - accounting for the majority of the predictive power. This aligns with the RF model. However, the Cubist model fails on this data and does not identify any predictors of importance. It is not clear why the Cubist would fail to identify predictors.

set.seed(200)

# Fit a boosted tree model
boosted_model <- gbm(formula = y ~ .,
                     data = simulated,
                     distribution = "gaussian",
                     n.trees = 1000,
                     interaction.depth = 3,
                     shrinkage = 0.01,
                     n.minobsinnode = 10)

# Calculate variable importance for the boosted tree
boosted_importance <- summary(boosted_model, plotit = FALSE)
boosted_importance
##                   var    rel.inf
## V4                 V4 28.2607622
## V2                 V2 21.6725509
## V1                 V1 20.2846806
## V5                 V5 11.7094661
## V3                 V3  7.4101097
## duplicate1 duplicate1  7.1868637
## V7                 V7  1.0467638
## V6                 V6  0.9816841
## V9                 V9  0.5604345
## V8                 V8  0.4865124
## V10               V10  0.4001722
# Fit a Cubist model
cubist_model <- cubist(
  x = simulated[, -ncol(simulated)],
  y = simulated$y)

# Calculate variable importance for the Cubist model
cubist_importance <- varImp(cubist_model)
cubist_importance
##     Overall
## y        50
## V1        0
## V2        0
## V3        0
## V4        0
## V5        0
## V6        0
## V7        0
## V8        0
## V9        0
## V10       0


Exercise 8.2

Use a simulation to show tree bias with different granularities.

We plot 3 distinct granularities:

  • Predictor high_gran generates a numeric variable using a random normal distribution - which creates a high-granularity, continuous predictor.
  • Predictor low_gran creates a low granularity categorical variable with only 3 distinct levels.
  • Predictor mid_gran creates an intermediate categorical variable with 50 distinct levels providing more variation than low-granularity predictors.

In the random forest, the predictor with high granularity is the most influential while the predictors with low and intermediate granularity contribute minimally or negatively, indicating their limited importance.

# Simulate data
set.seed(200)
n <- 500

# Predictor with high continuous granularity
high_gran <- rnorm(n)

# Predictor with low granularity
low_gran <- factor(sample(c("A", "B", "C"), n, replace = TRUE))

# Predictor with intermediate granularity
mid_gran <- factor(sample(1:50, n, replace = TRUE))

# Response variable
y <- 2 * high_gran + rnorm(n)

# Combine data
simulated_data <- data.frame(y = y, high_gran = high_gran, low_gran = low_gran, mid_gran = mid_gran)

# Fit a random forest
rf_model <- randomForest(y ~ ., 
                         data = simulated_data, 
                         importance = TRUE, 
                         ntree = 500)

# Calculate variable importance
rf_importance <- varImp(rf_model, scale = FALSE)

# Display variable importance
print(rf_importance)
##                Overall
## high_gran  4.230081810
## low_gran  -0.006737338
## mid_gran  -0.094722376


Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:


Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

With high learning rate (0.9) and bagging fraction (0.9), the model takes aggressive steps (learning rate) and uses most of the training data (bagging fraction), causing it to quickly converge on the strongest predictors and concentrate importance in just those few variables. Low learning rate (0.1) and bagging fraction (0.1) result in smaller, more cautious steps (learning rate) and introduce more randomness through smaller data samples (bagging fraction); this allows the model to discover and utilize more complex relationships across a broader range of predictors. This difference explains why the right plot shows importance concentrated in a few top predictors with high learning rate/ high bagging fraction while the left plot distributes importance more evenly across many predictors with low learning rate / low bagging fraction.


Which model do you think would be more predictive of other samples?

The model with lower learning rate (0.1) and lower bagging fraction (0.1) will generalize better on new data since it spreads importance across a broader range of predictors; this allows it to account for more variability in new data. The high learning rate and high bagging fraction may work well on the training data set – but it will overfit and not generalize well to new data. The high parameters model a narrow set of dominant predictors and do not account for more nuanced relationships/variability across other predictors.


How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Allowing deeper trees and more complex interactions would likely make the importance distribution more extreme at both ends of this range. For the high learning rate/bagging fraction model (0.9/0.9), the deeper trees allow stronger predictors to be used multiple times in different interactions – further concentrating importance on a few dominant predictors,

For the low settings model (0.1/0.1), the increased complexity and tree depth would enable discovery of more nuanced predictor relationships – spreading the importance across more variables.


Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:


Which tree-based regression model gives the optimal resampling and test set performance?

I modeled the random forest, gradient boosted, cubist and conditional tree models. The are the overall results:

Model RMSE R-squared MAE
Random Forest 0.9823121 0.6791878 0.7301441
Gradient Boosted 1.1434369 0.5681299 0.9060823
Cubist 0.9303243 0.7164876 0.7151199
Conditional Tree 1.2657398 0.4618865 0.967534

Based on the performance metrics, the Cubist model performs best overall with the lowest RMSE (0.93), highest R-squared (0.72), and lowest MAE (0.72), followed by Random Forest, Gradient Boosting and then the Conditional Tree. The Cubist model likely performs best on the chemical manufacturing data because it combines rule-based partitioning with linear models. This combination may help it identify local patterns in chemical manufacturing process variations.


Load Data
data("ChemicalManufacturingProcess")

# Separate predictors and outcome variables
predictors <- ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "Yield"]
outcome <- ChemicalManufacturingProcess$Yield

# Create KNN imputation
imputation_model <- preProcess(predictors, method = "knnImpute")

# Apply imputation model
predictors_imputed <- predict(imputation_model, newdata = predictors)

# Recombine imputed predictor variables
data_imputed <- cbind(predictors_imputed, Yield = outcome)


Random Forest
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]

# Train random forest with cross-validation
control <- trainControl(method = "cv", 
                       number = 10)

rf_model <- train(Yield ~ ., 
                 data = train_data,
                 method = "rf",
                 trControl = control,
                 importance = TRUE)

# Make predictions
rf_predictions <- predict(rf_model, test_data)

# Get performance metrics
rf_performance <- postResample(pred = rf_predictions, 
                             obs = test_data$Yield)

# Variable importance
rf_importance <- varImp(rf_model)

print(rf_performance)
##      RMSE  Rsquared       MAE 
## 0.9823121 0.6791878 0.7301441
print(rf_importance)
## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess17   50.23
## BiologicalMaterial03     37.92
## ManufacturingProcess13   26.27
## BiologicalMaterial11     25.00
## ManufacturingProcess31   23.02
## ManufacturingProcess06   22.71
## BiologicalMaterial04     22.29
## ManufacturingProcess09   21.90
## BiologicalMaterial06     20.64
## BiologicalMaterial12     19.46
## ManufacturingProcess33   19.25
## ManufacturingProcess30   18.86
## ManufacturingProcess11   18.40
## BiologicalMaterial02     17.24
## BiologicalMaterial05     16.49
## ManufacturingProcess01   16.47
## ManufacturingProcess28   16.15
## ManufacturingProcess29   16.04
## BiologicalMaterial01     15.92


Gradient Boosted
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]

# Create tuning grid for gbm parameters
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                      n.trees = seq(100, 1000, by = 50),
                      shrinkage = c(0.01, 0.1),
                      n.minobsinnode = 10)

# Train gbm with cross-validation
control <- trainControl(method = "cv", 
                       number = 10)

gbm_model <- train(Yield ~ ., 
                  data = train_data,
                  method = "gbm",
                  trControl = control,
                  tuneGrid = gbmGrid,
                  verbose = FALSE)

# Make predictions
gbm_predictions <- predict(gbm_model, test_data)

# Get performance metrics
gbm_performance <- postResample(pred = gbm_predictions, 
                              obs = test_data$Yield)

# Variable importance
gbm_importance <- varImp(gbm_model)

print(gbm_performance)
##      RMSE  Rsquared       MAE 
## 1.1434369 0.5681299 0.9060823
print(gbm_importance)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  31.557
## ManufacturingProcess09  29.098
## ManufacturingProcess06  25.313
## ManufacturingProcess13  23.507
## BiologicalMaterial03    23.187
## ManufacturingProcess31  19.058
## BiologicalMaterial06    16.772
## BiologicalMaterial12    13.498
## ManufacturingProcess05  11.836
## BiologicalMaterial09    11.445
## ManufacturingProcess04   9.943
## ManufacturingProcess01   9.851
## ManufacturingProcess37   8.882
## BiologicalMaterial01     8.629
## BiologicalMaterial10     7.670
## BiologicalMaterial05     7.464
## ManufacturingProcess45   7.211
## BiologicalMaterial11     7.066
## ManufacturingProcess26   6.819


Cubist
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]

# Create tuning grid for Cubist parameters
cubistGrid <- expand.grid(committees = seq(1, 100, by = 10),
                         neighbors = c(0, 5, 9))

# Train Cubist with cross-validation
control <- trainControl(method = "cv", 
                       number = 10)

cubist_model <- train(Yield ~ ., 
                      data = train_data,
                      method = "cubist",
                      trControl = control,
                      tuneGrid = cubistGrid)

# Make predictions
cubist_predictions <- predict(cubist_model, test_data)

# Get performance metrics
cubist_performance <- postResample(pred = cubist_predictions, 
                                 obs = test_data$Yield)

# Variable importance
cubist_importance <- varImp(cubist_model)

print(cubist_performance)
##      RMSE  Rsquared       MAE 
## 0.9819949 0.6860692 0.7431321
print(cubist_importance)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess17  100.00
## ManufacturingProcess32   95.56
## ManufacturingProcess09   56.67
## BiologicalMaterial02     47.78
## ManufacturingProcess39   38.89
## ManufacturingProcess28   37.78
## BiologicalMaterial06     36.67
## ManufacturingProcess06   34.44
## ManufacturingProcess33   27.78
## BiologicalMaterial03     27.78
## ManufacturingProcess13   24.44
## ManufacturingProcess29   23.33
## ManufacturingProcess45   22.22
## ManufacturingProcess04   18.89
## BiologicalMaterial01     16.67
## ManufacturingProcess26   15.56
## BiologicalMaterial09     14.44
## BiologicalMaterial08     13.33
## ManufacturingProcess31   11.11
## BiologicalMaterial10     11.11


Conditional Tree
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]

# Train conditional tree with cross-validation
control <- trainControl(method = "cv", 
                      number = 10)

ctree_model <- train(Yield ~ ., 
                   data = train_data,
                   method = "ctree",
                   trControl = control)

# Make predictions
ctree_predictions <- predict(ctree_model, test_data)

# Get performance metrics
ctree_performance <- postResample(pred = ctree_predictions, 
                               obs = test_data$Yield)

# Variable importance
ctree_importance <- varImp(ctree_model)

# Print performance metrics
print(ctree_performance)
##      RMSE  Rsquared       MAE 
## 1.2657398 0.4618865 0.9675345


Top Predictors

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Manufacturing process variables dominate the top predictors, with ManufacturingProcess17, 32, 09, 06, and 28 having the highest importance scores. Biological materials appear in the top 10 predictors, but they have lower importance scores, suggesting the manufacturing process parameters have a stronger influence on the yield variable.

My previous optimal non linear model was MARS (Rsq = 0.49) and the optimal linear model was Lasso (Rsq = 0.56). Both Cubist and MARS emphasize manufacturing process variables (particularly manufacturing process 17, 25, 26, 29, 32). The Lasso model provided more mixed importance between both biological and manufacturing variables; Lasso included BiologicalMaterial06 and 03 as top predictors. ManufacturingProcess32 appears to be important across all three models suggesting it’s a particularly important variable. However,while the models disagree on the importance of biological variables, more manufacturing processes appear to contribute to importance than the biological ones.


CT Plot

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

I cannot plot the Cubist tree due to its complexity. Instead, I plotted the Conditional Tree.

This tree shows that manufacturing processes (Process32 and Process09) influence the yield. ManufacturingProcess32 plays a dominant role in influencing yield, with a secondary refinement provided by ManufacturingProcess09

# Get the final CTree model
final_ctree <- ctree_model$finalModel

# Plot
plot(final_ctree)