# Load libraries
library(caret)
library(AppliedPredictiveModeling)
library(mlbench)
library(partykit)
library(Cubist)
library(randomForest)
library(party)
library(gbm)
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
Recreate the simulated data from Exercise 7.2.
# Set the seed
set.seed(200)
# Generate simulated data using Friedman1 function
simulated <- mlbench.friedman1(200, sd = 1)
# Combine predictors and response into a single data frame
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
# Rename the last column to "y" (response variable)
colnames(simulated)[ncol(simulated)] <- "y"
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No - the Random Forest does not significantly use the uninformative predictors. The low or negative importance scores for V6 - V10 indicate that these variables had little to no impact. Variables V6 - V10 were ignored with importance scores near zero or negative. V1, V2 and V4 contributed the most the model’s predictive power.
set.seed(200)
# Build the random forest
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
# Compute variable importance
rfImp1 <- varImp(model1, scale = FALSE)
# Display variable importance
rfImp1
## Overall
## V1 8.5406729
## V2 6.4511374
## V3 0.7470911
## V4 7.6128264
## V5 2.2684088
## V6 0.1190456
## V7 0.1021863
## V8 -0.1632359
## V9 -0.1296344
## V10 -0.0564990
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
With conditional inference trees and this data, the pattern remains the same for V2 and V4 remaining important - though slightly reduced. The uninformative variables (V6–V10) are near-zero or negative scores and ignored. The change in pattern of the conditional over the traditional importance targets V1 and its correlated duplicate1 predictor. Both have been reduced significantly with duplicate1 lowered more so. The reduction reflects the adjustment for the correlated explanatory power between V1 and duplicate1.
# Set seed
set.seed(200)
# Fit a random forest model using conditional inference trees
cforest_model <- cforest(y ~ .,
data = simulated,
control = cforest_control(
ntree = 1000,
mtry = 5))
# Calculate traditional variable importance
varimp_traditional <- varimp(cforest_model, conditional = FALSE)
# Calculate conditional variable importance
varimp_conditional <- varimp(cforest_model, conditional = TRUE)
# Print variable importance
print("Traditional Variable Importance:")
## [1] "Traditional Variable Importance:"
## V1 V2 V3 V4 V5 V6
## 6.708312002 6.466771844 0.038456900 8.307774215 2.097853877 0.029700732
## V7 V8 V9 V10 duplicate1
## 0.069557380 -0.018926189 -0.035472347 -0.004772707 2.854234955
## [1] "Conditional Variable Importance:"
## V1 V2 V3 V4 V5 V6
## 1.541583201 3.850850812 0.018334131 5.179760111 0.799189889 0.005908606
## V7 V8 V9 V10 duplicate1
## 0.020205315 -0.012054079 -0.003961872 -0.006614206 0.864530427
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
The boosted model also prioritizes V4, V2, and V1 - accounting for the majority of the predictive power. This aligns with the RF model. However, the Cubist model fails on this data and does not identify any predictors of importance. It is not clear why the Cubist would fail to identify predictors.
set.seed(200)
# Fit a boosted tree model
boosted_model <- gbm(formula = y ~ .,
data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.01,
n.minobsinnode = 10)
# Calculate variable importance for the boosted tree
boosted_importance <- summary(boosted_model, plotit = FALSE)
boosted_importance
## var rel.inf
## V4 V4 28.2607622
## V2 V2 21.6725509
## V1 V1 20.2846806
## V5 V5 11.7094661
## V3 V3 7.4101097
## duplicate1 duplicate1 7.1868637
## V7 V7 1.0467638
## V6 V6 0.9816841
## V9 V9 0.5604345
## V8 V8 0.4865124
## V10 V10 0.4001722
# Fit a Cubist model
cubist_model <- cubist(
x = simulated[, -ncol(simulated)],
y = simulated$y)
# Calculate variable importance for the Cubist model
cubist_importance <- varImp(cubist_model)
cubist_importance
## Overall
## y 50
## V1 0
## V2 0
## V3 0
## V4 0
## V5 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
Use a simulation to show tree bias with different granularities.
We plot 3 distinct granularities:
In the random forest, the predictor with high granularity is the most influential while the predictors with low and intermediate granularity contribute minimally or negatively, indicating their limited importance.
# Simulate data
set.seed(200)
n <- 500
# Predictor with high continuous granularity
high_gran <- rnorm(n)
# Predictor with low granularity
low_gran <- factor(sample(c("A", "B", "C"), n, replace = TRUE))
# Predictor with intermediate granularity
mid_gran <- factor(sample(1:50, n, replace = TRUE))
# Response variable
y <- 2 * high_gran + rnorm(n)
# Combine data
simulated_data <- data.frame(y = y, high_gran = high_gran, low_gran = low_gran, mid_gran = mid_gran)
# Fit a random forest
rf_model <- randomForest(y ~ .,
data = simulated_data,
importance = TRUE,
ntree = 500)
# Calculate variable importance
rf_importance <- varImp(rf_model, scale = FALSE)
# Display variable importance
print(rf_importance)
## Overall
## high_gran 4.230081810
## low_gran -0.006737338
## mid_gran -0.094722376
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
With high learning rate (0.9) and bagging fraction (0.9), the model takes aggressive steps (learning rate) and uses most of the training data (bagging fraction), causing it to quickly converge on the strongest predictors and concentrate importance in just those few variables. Low learning rate (0.1) and bagging fraction (0.1) result in smaller, more cautious steps (learning rate) and introduce more randomness through smaller data samples (bagging fraction); this allows the model to discover and utilize more complex relationships across a broader range of predictors. This difference explains why the right plot shows importance concentrated in a few top predictors with high learning rate/ high bagging fraction while the left plot distributes importance more evenly across many predictors with low learning rate / low bagging fraction.
Which model do you think would be more predictive of other samples?
The model with lower learning rate (0.1) and lower bagging fraction (0.1) will generalize better on new data since it spreads importance across a broader range of predictors; this allows it to account for more variability in new data. The high learning rate and high bagging fraction may work well on the training data set – but it will overfit and not generalize well to new data. The high parameters model a narrow set of dominant predictors and do not account for more nuanced relationships/variability across other predictors.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Allowing deeper trees and more complex interactions would likely make the importance distribution more extreme at both ends of this range. For the high learning rate/bagging fraction model (0.9/0.9), the deeper trees allow stronger predictors to be used multiple times in different interactions – further concentrating importance on a few dominant predictors,
For the low settings model (0.1/0.1), the increased complexity and tree depth would enable discovery of more nuanced predictor relationships – spreading the importance across more variables.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
I modeled the random forest, gradient boosted, cubist and conditional tree models. The are the overall results:
Model | RMSE | R-squared | MAE |
---|---|---|---|
Random Forest | 0.9823121 | 0.6791878 | 0.7301441 |
Gradient Boosted | 1.1434369 | 0.5681299 | 0.9060823 |
Cubist | 0.9303243 | 0.7164876 | 0.7151199 |
Conditional Tree | 1.2657398 | 0.4618865 | 0.967534 |
Based on the performance metrics, the Cubist model performs best overall with the lowest RMSE (0.93), highest R-squared (0.72), and lowest MAE (0.72), followed by Random Forest, Gradient Boosting and then the Conditional Tree. The Cubist model likely performs best on the chemical manufacturing data because it combines rule-based partitioning with linear models. This combination may help it identify local patterns in chemical manufacturing process variations.
data("ChemicalManufacturingProcess")
# Separate predictors and outcome variables
predictors <- ChemicalManufacturingProcess[, !names(ChemicalManufacturingProcess) %in% "Yield"]
outcome <- ChemicalManufacturingProcess$Yield
# Create KNN imputation
imputation_model <- preProcess(predictors, method = "knnImpute")
# Apply imputation model
predictors_imputed <- predict(imputation_model, newdata = predictors)
# Recombine imputed predictor variables
data_imputed <- cbind(predictors_imputed, Yield = outcome)
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]
# Train random forest with cross-validation
control <- trainControl(method = "cv",
number = 10)
rf_model <- train(Yield ~ .,
data = train_data,
method = "rf",
trControl = control,
importance = TRUE)
# Make predictions
rf_predictions <- predict(rf_model, test_data)
# Get performance metrics
rf_performance <- postResample(pred = rf_predictions,
obs = test_data$Yield)
# Variable importance
rf_importance <- varImp(rf_model)
print(rf_performance)
## RMSE Rsquared MAE
## 0.9823121 0.6791878 0.7301441
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess17 50.23
## BiologicalMaterial03 37.92
## ManufacturingProcess13 26.27
## BiologicalMaterial11 25.00
## ManufacturingProcess31 23.02
## ManufacturingProcess06 22.71
## BiologicalMaterial04 22.29
## ManufacturingProcess09 21.90
## BiologicalMaterial06 20.64
## BiologicalMaterial12 19.46
## ManufacturingProcess33 19.25
## ManufacturingProcess30 18.86
## ManufacturingProcess11 18.40
## BiologicalMaterial02 17.24
## BiologicalMaterial05 16.49
## ManufacturingProcess01 16.47
## ManufacturingProcess28 16.15
## ManufacturingProcess29 16.04
## BiologicalMaterial01 15.92
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]
# Create tuning grid for gbm parameters
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
# Train gbm with cross-validation
control <- trainControl(method = "cv",
number = 10)
gbm_model <- train(Yield ~ .,
data = train_data,
method = "gbm",
trControl = control,
tuneGrid = gbmGrid,
verbose = FALSE)
# Make predictions
gbm_predictions <- predict(gbm_model, test_data)
# Get performance metrics
gbm_performance <- postResample(pred = gbm_predictions,
obs = test_data$Yield)
# Variable importance
gbm_importance <- varImp(gbm_model)
print(gbm_performance)
## RMSE Rsquared MAE
## 1.1434369 0.5681299 0.9060823
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 31.557
## ManufacturingProcess09 29.098
## ManufacturingProcess06 25.313
## ManufacturingProcess13 23.507
## BiologicalMaterial03 23.187
## ManufacturingProcess31 19.058
## BiologicalMaterial06 16.772
## BiologicalMaterial12 13.498
## ManufacturingProcess05 11.836
## BiologicalMaterial09 11.445
## ManufacturingProcess04 9.943
## ManufacturingProcess01 9.851
## ManufacturingProcess37 8.882
## BiologicalMaterial01 8.629
## BiologicalMaterial10 7.670
## BiologicalMaterial05 7.464
## ManufacturingProcess45 7.211
## BiologicalMaterial11 7.066
## ManufacturingProcess26 6.819
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]
# Create tuning grid for Cubist parameters
cubistGrid <- expand.grid(committees = seq(1, 100, by = 10),
neighbors = c(0, 5, 9))
# Train Cubist with cross-validation
control <- trainControl(method = "cv",
number = 10)
cubist_model <- train(Yield ~ .,
data = train_data,
method = "cubist",
trControl = control,
tuneGrid = cubistGrid)
# Make predictions
cubist_predictions <- predict(cubist_model, test_data)
# Get performance metrics
cubist_performance <- postResample(pred = cubist_predictions,
obs = test_data$Yield)
# Variable importance
cubist_importance <- varImp(cubist_model)
print(cubist_performance)
## RMSE Rsquared MAE
## 0.9819949 0.6860692 0.7431321
## cubist variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess17 100.00
## ManufacturingProcess32 95.56
## ManufacturingProcess09 56.67
## BiologicalMaterial02 47.78
## ManufacturingProcess39 38.89
## ManufacturingProcess28 37.78
## BiologicalMaterial06 36.67
## ManufacturingProcess06 34.44
## ManufacturingProcess33 27.78
## BiologicalMaterial03 27.78
## ManufacturingProcess13 24.44
## ManufacturingProcess29 23.33
## ManufacturingProcess45 22.22
## ManufacturingProcess04 18.89
## BiologicalMaterial01 16.67
## ManufacturingProcess26 15.56
## BiologicalMaterial09 14.44
## BiologicalMaterial08 13.33
## ManufacturingProcess31 11.11
## BiologicalMaterial10 11.11
# Create train/test split
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.7, list = FALSE)
train_data <- data_imputed[trainIndex, ]
test_data <- data_imputed[-trainIndex, ]
# Train conditional tree with cross-validation
control <- trainControl(method = "cv",
number = 10)
ctree_model <- train(Yield ~ .,
data = train_data,
method = "ctree",
trControl = control)
# Make predictions
ctree_predictions <- predict(ctree_model, test_data)
# Get performance metrics
ctree_performance <- postResample(pred = ctree_predictions,
obs = test_data$Yield)
# Variable importance
ctree_importance <- varImp(ctree_model)
# Print performance metrics
print(ctree_performance)
## RMSE Rsquared MAE
## 1.2657398 0.4618865 0.9675345
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Manufacturing process variables dominate the top predictors, with ManufacturingProcess17, 32, 09, 06, and 28 having the highest importance scores. Biological materials appear in the top 10 predictors, but they have lower importance scores, suggesting the manufacturing process parameters have a stronger influence on the yield variable.
My previous optimal non linear model was MARS (Rsq = 0.49) and the optimal linear model was Lasso (Rsq = 0.56). Both Cubist and MARS emphasize manufacturing process variables (particularly manufacturing process 17, 25, 26, 29, 32). The Lasso model provided more mixed importance between both biological and manufacturing variables; Lasso included BiologicalMaterial06 and 03 as top predictors. ManufacturingProcess32 appears to be important across all three models suggesting it’s a particularly important variable. However,while the models disagree on the importance of biological variables, more manufacturing processes appear to contribute to importance than the biological ones.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
I cannot plot the Cubist tree due to its complexity. Instead, I plotted the Conditional Tree.
This tree shows that manufacturing processes (Process32 and Process09) influence the yield. ManufacturingProcess32 plays a dominant role in influencing yield, with a secondary refinement provided by ManufacturingProcess09