8.1. Recreate the simulated data from Exercise 7.2:
According to the list below, V6-V10 seems not important in the forest model.
#from book
set.seed(420)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"
model_rf = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_Imp = varImp(model_rf, scale = FALSE)
rf_Imp
## Overall
## V1 6.29089217
## V2 7.23706200
## V3 1.52229636
## V4 9.93245576
## V5 0.82685129
## V6 0.11773907
## V7 -0.01531565
## V8 0.23933146
## V9 0.01159528
## V10 0.17604773
Importance of A1 decreased after after adding another highly correlated variable, although its still in top 3, but top importantce varabile become V4 and V2.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9377547
model_rf2 <- randomForest(y ~ ., data= simulated, importance= TRUE, ntree= 1000)
rf_imp2 <- varImp(model_rf2, scale=FALSE)
rf_imp2
## Overall
## V1 4.8648919382
## V2 7.3372450126
## V3 1.2728087767
## V4 9.8316668348
## V5 0.9354855143
## V6 0.1025747835
## V7 -0.0006555202
## V8 0.2455059911
## V9 -0.1369595734
## V10 0.0811743538
## duplicate1 3.3725652358
As we can see from the list below, when variable importance is conditional, it considers correlation between variable v1 and duplicate1. Under the case of unconditional, V1 and duplicate1 are showing similar importance numbe. The pattern doesn’t seem to be same pattern as traditional random forest model.
model_cforest <- cforest(y ~ ., data= simulated)
cf_imp3<-varimp(model_cforest) |> sort(decreasing = TRUE) # Without conditional
cf_imp4<-varimp(model_cforest, conditional = TRUE) |> sort(decreasing=TRUE) # With conditional
as.data.frame(cbind(Model2 = rf_imp2, Conditional = cf_imp3, Unconditional = cf_imp4))
## Overall Conditional Unconditional
## V1 4.8648919382 10.549970533 7.923372934
## V2 7.3372450126 7.335339707 5.273348585
## V3 1.2728087767 4.931780724 3.078124787
## V4 9.8316668348 2.422122190 0.966548301
## V5 0.9354855143 0.612462065 0.473301500
## V6 0.1025747835 0.189542140 0.079202570
## V7 -0.0006555202 0.110272311 0.053099001
## V8 0.2455059911 0.007043181 0.047259462
## V9 -0.1369595734 0.006203886 0.038465632
## V10 0.0811743538 0.005824392 0.002199627
## duplicate1 3.3725652358 -0.037999472 -0.011565963
Yes, patterns are the same. We see that V6-V10 are still among the lowest in importance list, and V4 is still the top predictors from the list.
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2),
n.trees = seq(100, 1000, by = 100),
shrinkage = 0.1,
n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated,
tuneGrid = gbmGrid, verbose = FALSE,
method = 'gbm')
gbm_imp<-varImp(model_gbm)
model_cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp<-varImp(model_cubist)
#Compare
df = as.data.frame(cbind(gbm_imp$importance, cubist_imp))
names(df) = c("boosted", "cubist")
df
## boosted cubist
## V1 50.155149 94.5
## V2 62.320221 78.0
## V3 33.478013 79.5
## V4 100.000000 60.5
## V5 19.006005 39.5
## V6 4.066961 3.5
## V7 1.815576 0.0
## V8 1.546456 0.0
## V9 0.000000 0.0
## V10 7.685275 0.0
## duplicate1 12.152952 0.0
8.2. Use a simulation to show tree bias with different granularities.
This simulation shows why it’s important to be aware of granularity bias when interpreting variable importance from tree-based models. High granularity variables that has many categories can be over-selected in tree algorithms. cforest with conditional inference helps reduce those effect. Here inthe example Variable x1,x2 being true predictors shows highest importance. High granularity variables x3 with 100 categories shows little importance because of the conditional setting that helps reduces the bias.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
##
## where
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(123)
# Simulation parameters
n <- 1000 # number of observations
n_sim <- 10 # number of simulations
#simulate data with different granularities
simulate_data <- function(n) {
x1 <- rnorm(n) # Continuous, fine granularity
x2 <- rnorm(n) # Continuous, fine granularity
# Variables with different granularities
x3 <- sample(1:100, n, replace = TRUE) # High granularity
x4 <- sample(1:10, n, replace = TRUE) # Medium
x5 <- sample(1:5, n, replace = TRUE) # Low
x6 <- sample(1:2, n, replace = TRUE) # lowest
# Noise variables
x7 <- rnorm(n)
x8 <- rnorm(n)
# only x1 and x2 are truly important
y <- 2*x1 + 3*x2 + rnorm(n, 0, 0.5)
data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8)
}
# Run multiple simulations to observe bias patterns
results_list <- list()
for(i in 1:n_sim) {
# Simulate data
sim_data <- simulate_data(n)
# Convert high cardinality variables to factors
sim_data$x3 <- as.factor(sim_data$x3)
sim_data$x4 <- as.factor(sim_data$x4)
sim_data$x5 <- as.factor(sim_data$x5)
sim_data$x6 <- as.factor(sim_data$x6)
# Fit orest
cf <- cforest(y ~ .,
data = sim_data,
controls = cforest_control(ntree = 100, mtry = 3))
# variable importance
vi <- varimp(cf)
results_list[[i]] <- data.frame(
variable = names(vi),
importance = as.numeric(vi),
simulation = i
)
}
# Combine results
all_results <- bind_rows(results_list)
# Calculate average importance across simulations
avg_importance <- all_results |>
group_by(variable) |>
summarise(
mean_imp = mean(importance),
sd_imp = sd(importance),
se_imp = sd_imp / sqrt(n_sim)
) |>
arrange(desc(mean_imp))
print(avg_importance)
## # A tibble: 8 × 4
## variable mean_imp sd_imp se_imp
## <chr> <dbl> <dbl> <dbl>
## 1 x2 13.7 0.533 0.168
## 2 x1 5.69 0.290 0.0917
## 3 x3 0.112 0.0614 0.0194
## 4 x7 0.00400 0.0189 0.00599
## 5 x6 0.00334 0.0206 0.00652
## 6 x8 -0.00756 0.0155 0.00491
## 7 x4 -0.0118 0.0253 0.00800
## 8 x5 -0.0129 0.0245 0.00773
ggplot(avg_importance, aes(x = reorder(variable, mean_imp), y = mean_imp)) +
geom_col(aes(fill = variable), alpha = 0.8) +
geom_errorbar(aes(ymin = mean_imp - se_imp, ymax = mean_imp + se_imp),
width = 0.2) +
coord_flip() +
labs(
title = "Tree Algorithm Bias Towards High-Granularity Variables",
subtitle = "cforest variable importance across multiple simulations",
x = "Variables",
y = "Variable Importance",
) +
scale_fill_manual(values = c(
"x1" = "darkgreen", "x2" = "darkgreen", # True signals
"x3" = "red", # High granularity bias
"x4" = "orange", "x5" = "yellow", "x6" = "blue", # Other granularities
"x7" = "gray", "x8" = "gray" # Noise
)) +
theme_minimal() +
theme(legend.position = "none")
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The right hand plot focus its importantce on just first few predictors becuase it has higher bagging fraction and learning ratio, the training data used to produce the decision tree become higher. Left hand model has lower ratio, which makes the model more likely to identify more predictors deemed to be important.
The model on the left would be more predictive of the samples. It’s more likely to include more important predictors due to its lower learning and bagging rate.
Increase interaction depth will increase more predictors, vhe variable importance is more likely to be spread out over more predictors and in the end RMSE error likely to be lower.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: (a) Which tree-based regression model gives the optimal resampling and test set performance?
The random forest tree based regression model provide the most optimal result with lowest RMSE value of 0.55.
#load data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
set.seed(888)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)
index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <- full_data[index_chem,]
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <- test_chem[-c(1)]
Random Forest. It has RMSE,Rsquared and MAE score: 0.5530519, 0.6320963 and 0.4230153.
set.seed(624)
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model
##
## Call:
## randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE, ntrees = 1000)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 0.3787238
## % Var explained: 63.42
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 0.5530519 0.6320963 0.4230153
Boosted Trees. It has RMSE,Rsquared and MAE score: 0.8701713, 0.5511551 and 0.7387145.
set.seed(624)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.0347 nan 0.0010 0.0006
## 2 1.0339 nan 0.0010 0.0007
## 3 1.0331 nan 0.0010 0.0006
## 4 1.0325 nan 0.0010 0.0005
## 5 1.0317 nan 0.0010 0.0007
## 6 1.0307 nan 0.0010 0.0007
## 7 1.0300 nan 0.0010 0.0007
## 8 1.0293 nan 0.0010 0.0007
## 9 1.0284 nan 0.0010 0.0005
## 10 1.0277 nan 0.0010 0.0006
## 20 1.0203 nan 0.0010 0.0003
## 40 1.0068 nan 0.0010 0.0008
## 60 0.9925 nan 0.0010 0.0007
## 80 0.9797 nan 0.0010 0.0006
## 100 0.9671 nan 0.0010 0.0005
gbmPred <- predict(gbm_model, newdata = test_predictors)
## Using 100 trees...
postResample(pred = gbmPred, obs = test_chem$Yield)
## RMSE Rsquared MAE
## 0.8701713 0.5511551 0.7387145
From the list below, we can see the predictors dominating the list is ManufacturingProcess, and the rest are BiologicalProcess predictors. The outcome is same with that from the non-linear model. Top 10 are ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03, ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11, ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28, ManufacturingProcess11
head(varImp(rf_model),10)
## Overall
## BiologicalMaterial01 4.859071
## BiologicalMaterial03 12.375324
## BiologicalMaterial05 6.642959
## BiologicalMaterial06 10.835926
## BiologicalMaterial08 4.937676
## BiologicalMaterial09 4.381674
## BiologicalMaterial10 4.177989
## BiologicalMaterial11 10.135233
## ManufacturingProcess01 4.251294
## ManufacturingProcess02 3.265441
The tree starts with the most important predictors at the top - ManufacturingProcess32. I think this view do provide additional knowldege as it shows root node between predictors. Seven of the ten terminal nodes are predicted by Manufacturing Process variables and remaining three by Biological Material. This suggests that the Manufacturing Process variables are strong predictors of the yield response variable.
library(rpart)
library(rpart.plot)
#train single tree model
rpart_tree <- rpart(Yield ~., data = train_chem)
rpart.plot(rpart_tree)