8.1. Recreate the simulated data from Exercise 7.2:

  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

According to the list below, V6-V10 seems not important in the forest model.

#from book
set.seed(420)
simulated = mlbench.friedman1(200, sd = 1)
simulated = cbind(simulated$x, simulated$y)
simulated = as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] = "y"

model_rf = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rf_Imp = varImp(model_rf, scale = FALSE)

rf_Imp
##         Overall
## V1   6.29089217
## V2   7.23706200
## V3   1.52229636
## V4   9.93245576
## V5   0.82685129
## V6   0.11773907
## V7  -0.01531565
## V8   0.23933146
## V9   0.01159528
## V10  0.17604773
  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example: > simulated\(duplicate1 <- simulated\)V1 + rnorm(200) * .1 > cor(simulated\(duplicate1, simulated\)V1). Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Importance of A1 decreased after after adding another highly correlated variable, although its still in top 3, but top importantce varabile become V4 and V2.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9377547
model_rf2 <- randomForest(y ~ ., data= simulated, importance= TRUE, ntree= 1000)
rf_imp2 <- varImp(model_rf2, scale=FALSE)
rf_imp2
##                  Overall
## V1          4.8648919382
## V2          7.3372450126
## V3          1.2728087767
## V4          9.8316668348
## V5          0.9354855143
## V6          0.1025747835
## V7         -0.0006555202
## V8          0.2455059911
## V9         -0.1369595734
## V10         0.0811743538
## duplicate1  3.3725652358
  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

As we can see from the list below, when variable importance is conditional, it considers correlation between variable v1 and duplicate1. Under the case of unconditional, V1 and duplicate1 are showing similar importance numbe. The pattern doesn’t seem to be same pattern as traditional random forest model.

model_cforest <- cforest(y ~ ., data= simulated)
cf_imp3<-varimp(model_cforest) |> sort(decreasing = TRUE) # Without conditional
cf_imp4<-varimp(model_cforest, conditional = TRUE) |> sort(decreasing=TRUE) # With conditional 
as.data.frame(cbind(Model2 = rf_imp2, Conditional = cf_imp3, Unconditional = cf_imp4))
##                  Overall  Conditional Unconditional
## V1          4.8648919382 10.549970533   7.923372934
## V2          7.3372450126  7.335339707   5.273348585
## V3          1.2728087767  4.931780724   3.078124787
## V4          9.8316668348  2.422122190   0.966548301
## V5          0.9354855143  0.612462065   0.473301500
## V6          0.1025747835  0.189542140   0.079202570
## V7         -0.0006555202  0.110272311   0.053099001
## V8          0.2455059911  0.007043181   0.047259462
## V9         -0.1369595734  0.006203886   0.038465632
## V10         0.0811743538  0.005824392   0.002199627
## duplicate1  3.3725652358 -0.037999472  -0.011565963
  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Yes, patterns are the same. We see that V6-V10 are still among the lowest in importance list, and V4 is still the top predictors from the list.

gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), 
                      n.trees = seq(100, 1000, by = 100), 
                      shrinkage = 0.1, 
                      n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated, 
                  tuneGrid = gbmGrid, verbose = FALSE, 
                  method = 'gbm')
gbm_imp<-varImp(model_gbm)

model_cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp<-varImp(model_cubist)
#Compare
df = as.data.frame(cbind(gbm_imp$importance, cubist_imp))
names(df) = c("boosted", "cubist")
df
##               boosted cubist
## V1          50.155149   94.5
## V2          62.320221   78.0
## V3          33.478013   79.5
## V4         100.000000   60.5
## V5          19.006005   39.5
## V6           4.066961    3.5
## V7           1.815576    0.0
## V8           1.546456    0.0
## V9           0.000000    0.0
## V10          7.685275    0.0
## duplicate1  12.152952    0.0

8.2. Use a simulation to show tree bias with different granularities.

This simulation shows why it’s important to be aware of granularity bias when interpreting variable importance from tree-based models. High granularity variables that has many categories can be over-selected in tree algorithms. cforest with conditional inference helps reduce those effect. Here inthe example Variable x1,x2 being true predictors shows highest importance. High granularity variables x3 with 100 categories shows little importance because of the conditional setting that helps reduces the bias.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:party':
## 
##     where
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(123)

# Simulation parameters
n <- 1000  # number of observations
n_sim <- 10 # number of simulations

#simulate data with different granularities
simulate_data <- function(n) {
 
  x1 <- rnorm(n)  # Continuous, fine granularity
  x2 <- rnorm(n)  # Continuous, fine granularity

  # Variables with different granularities 
  x3 <- sample(1:100, n, replace = TRUE)  # High granularity 
  x4 <- sample(1:10, n, replace = TRUE)   # Medium 
  x5 <- sample(1:5, n, replace = TRUE)    # Low 
  x6 <- sample(1:2, n, replace = TRUE)    # lowest 
  
  # Noise variables
  x7 <- rnorm(n)
  x8 <- rnorm(n)
  
  #  only x1 and x2 are truly important
  y <- 2*x1 + 3*x2 + rnorm(n, 0, 0.5)
  
  data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8)
}

# Run multiple simulations to observe bias patterns
results_list <- list()

for(i in 1:n_sim) {
  # Simulate data
  sim_data <- simulate_data(n)
  
  # Convert high cardinality variables to factors
  sim_data$x3 <- as.factor(sim_data$x3)
  sim_data$x4 <- as.factor(sim_data$x4)
  sim_data$x5 <- as.factor(sim_data$x5)
  sim_data$x6 <- as.factor(sim_data$x6)
  
  # Fit orest
  cf <- cforest(y ~ ., 
                data = sim_data, 
                controls = cforest_control(ntree = 100, mtry = 3))
  
  #  variable importance
  vi <- varimp(cf)
  
  results_list[[i]] <- data.frame(
    variable = names(vi),
    importance = as.numeric(vi),
    simulation = i
  )
}

# Combine  results
all_results <- bind_rows(results_list)

# Calculate average importance across simulations
avg_importance <- all_results |>
  group_by(variable) |>
  summarise(
    mean_imp = mean(importance),
    sd_imp = sd(importance),
    se_imp = sd_imp / sqrt(n_sim)
  ) |>
  arrange(desc(mean_imp))

print(avg_importance)
## # A tibble: 8 × 4
##   variable mean_imp sd_imp  se_imp
##   <chr>       <dbl>  <dbl>   <dbl>
## 1 x2       13.7     0.533  0.168  
## 2 x1        5.69    0.290  0.0917 
## 3 x3        0.112   0.0614 0.0194 
## 4 x7        0.00400 0.0189 0.00599
## 5 x6        0.00334 0.0206 0.00652
## 6 x8       -0.00756 0.0155 0.00491
## 7 x4       -0.0118  0.0253 0.00800
## 8 x5       -0.0129  0.0245 0.00773
ggplot(avg_importance, aes(x = reorder(variable, mean_imp), y = mean_imp)) +
  geom_col(aes(fill = variable), alpha = 0.8) +
  geom_errorbar(aes(ymin = mean_imp - se_imp, ymax = mean_imp + se_imp), 
                width = 0.2) +
  coord_flip() +
  labs(
    title = "Tree Algorithm Bias Towards High-Granularity Variables",
    subtitle = "cforest variable importance across multiple simulations",
    x = "Variables",
    y = "Variable Importance",
  ) +
  scale_fill_manual(values = c(
    "x1" = "darkgreen", "x2" = "darkgreen",  # True signals
    "x3" = "red",                            # High granularity bias
    "x4" = "orange", "x5" = "yellow", "x6" = "blue",  # Other granularities
    "x7" = "gray", "x8" = "gray"             # Noise
  )) +
  theme_minimal() +
  theme(legend.position = "none")

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The right hand plot focus its importantce on just first few predictors becuase it has higher bagging fraction and learning ratio, the training data used to produce the decision tree become higher. Left hand model has lower ratio, which makes the model more likely to identify more predictors deemed to be important.

  1. Which model do you think would be more predictive of other samples?

The model on the left would be more predictive of the samples. It’s more likely to include more important predictors due to its lower learning and bagging rate.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increase interaction depth will increase more predictors, vhe variable importance is more likely to be spread out over more predictors and in the end RMSE error likely to be lower.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: (a) Which tree-based regression model gives the optimal resampling and test set performance?

The random forest tree based regression model provide the most optimal result with lowest RMSE value of 0.55.

#load data
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

set.seed(888)

imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)

index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)

train_chem <-  full_data[index_chem,] 
test_chem <- full_data[-index_chem,]

train_predictors <- train_chem[-c(1)]
test_predictors <-  test_chem[-c(1)]

Random Forest. It has RMSE,Rsquared and MAE score: 0.5530519, 0.6320963 and 0.4230153.

set.seed(624)

rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model
## 
## Call:
##  randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 0.3787238
##                     % Var explained: 63.42
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.5530519 0.6320963 0.4230153

Boosted Trees. It has RMSE,Rsquared and MAE score: 0.8701713, 0.5511551 and 0.7387145.

set.seed(624)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.0347             nan     0.0010    0.0006
##      2        1.0339             nan     0.0010    0.0007
##      3        1.0331             nan     0.0010    0.0006
##      4        1.0325             nan     0.0010    0.0005
##      5        1.0317             nan     0.0010    0.0007
##      6        1.0307             nan     0.0010    0.0007
##      7        1.0300             nan     0.0010    0.0007
##      8        1.0293             nan     0.0010    0.0007
##      9        1.0284             nan     0.0010    0.0005
##     10        1.0277             nan     0.0010    0.0006
##     20        1.0203             nan     0.0010    0.0003
##     40        1.0068             nan     0.0010    0.0008
##     60        0.9925             nan     0.0010    0.0007
##     80        0.9797             nan     0.0010    0.0006
##    100        0.9671             nan     0.0010    0.0005
gbmPred <- predict(gbm_model, newdata = test_predictors)
## Using 100 trees...
postResample(pred = gbmPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.8701713 0.5511551 0.7387145
  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

From the list below, we can see the predictors dominating the list is ManufacturingProcess, and the rest are BiologicalProcess predictors. The outcome is same with that from the non-linear model. Top 10 are ManufacturingProcess32, BiologicalMaterial06, BiologicalMaterial03, ManufacturingProcess13, ManufacturingProcess36, BiologicalMaterial11, ManufacturingProcess09, BiologicalMaterial08, ManufacturingProcess28, ManufacturingProcess11

head(varImp(rf_model),10)
##                          Overall
## BiologicalMaterial01    4.859071
## BiologicalMaterial03   12.375324
## BiologicalMaterial05    6.642959
## BiologicalMaterial06   10.835926
## BiologicalMaterial08    4.937676
## BiologicalMaterial09    4.381674
## BiologicalMaterial10    4.177989
## BiologicalMaterial11   10.135233
## ManufacturingProcess01  4.251294
## ManufacturingProcess02  3.265441
  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The tree starts with the most important predictors at the top - ManufacturingProcess32. I think this view do provide additional knowldege as it shows root node between predictors. Seven of the ten terminal nodes are predicted by Manufacturing Process variables and remaining three by Biological Material. This suggests that the Manufacturing Process variables are strong predictors of the yield response variable.

library(rpart)
library(rpart.plot)

#train single tree model
rpart_tree <- rpart(Yield ~., data = train_chem)

rpart.plot(rpart_tree)