DATA 624 - Homework 9

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.2

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

model1 <- randomForest(
  y ~ .,
  data = simulated,
  importance = TRUE,
  ntree = 1000
)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The Random Forest model mostly relied on variables V1 to V5, which have much higher importance scores. In contrast, V6 to V10 have very low or even negative scores, meaning the model didn’t really use them to make decisions. So, it’s clear that the uninformative predictors (V6–V10) didn’t play a significant role.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(
  y ~ .,
  data = simulated,
  importance = TRUE,
  ntree = 1000
)

varImp(model2, scale = FALSE)

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Yes, the importance score for V1 decreasedwhen another highly correlated variable (duplicate1) was added. This is expected — Random Forest tends to split importance between correlated predictors, so when a duplicate or similar variable is introduced, they “share the credit,” lowering the individual importance of each.

(c) Use the cforest function in the `party` package to fit a random forest model using conditional inference trees. The `party` package function `varimp` can calculate predictor importance. The `conditional` argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

## Warning: package 'party' was built under R version 4.4.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.4.2

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.4.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.4.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.4.2

simulated <- simulated[, c(1:11)]

model4 <- cforest(
  y ~ .,
  data = simulated
)

varimp(model4, conditional = TRUE)

##           V1           V2           V3           V4           V5           V6 
##  5.471457361  5.166826657  0.020994281  6.689072245  1.256076719  0.004925215 
##           V7           V8           V9          V10 
## -0.008184439 -0.009141850 -0.012594617 -0.013200299

The results from the cforest model using the party package show a similar pattern to the traditional random forest model. Predictors like V1, V3, and V4 still have the highest importance scores, while the others, especially V6 through V10, show very low or even negative values. This means the model is consistently identifying the truly informative variables, regardless of which method is used. The main difference is that cforest adjusts for biases like variable correlation, making it a more reliable option when predictors are related.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

library(gbm)

## Warning: package 'gbm' was built under R version 4.4.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

global_tr_control <- trainControl(method = "cv", allowParallel = TRUE)

boost_tree_grid <- expand.grid(
  .interaction.depth = seq(1, 7, by = 1),
  .n.trees = seq(100, 1000, by = 50),
  .shrinkage = c(.01, .5, by = .05),
  .n.minobsinnode = c(5:15)
)

boost_fit <- train(
  simulated[, c(1:10)],
  simulated$y,
  method = "gbm",
  tuneGrid = boost_tree_grid,
  verbose = FALSE,
  trControl = global_tr_control
)

varImp(boost_fit, scale = FALSE)

## gbm variable importance
## 
##     Overall
## V4   9420.3
## V1   8484.2
## V2   7416.5
## V5   3714.4
## V3   3088.3
## V7    548.1
## V6    534.3
## V8    434.6
## V9    382.7
## V10   381.0

Cubist

library(Cubist)

## Warning: package 'Cubist' was built under R version 4.4.3

cubist_fit <- cubist(
  simulated[, c(1:10)],
  simulated$y
)

varImp(cubist_fit, scale = FALSE)

##     Overall
## V1       50
## V2       50
## V4       50
## V5       50
## V3        0
## V6        0
## V7        0
## V8        0
## V9        0
## V10       0

8.2 Use a simulation to show tree bias with different granularities.

library(rpart)
library(partykit)

## Warning: package 'partykit' was built under R version 4.4.3

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 4.4.3

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

set.seed(19940211)

c1 <- sample(1:10 / 10, 2000, replace = TRUE)
c2 <- sample(1:100 / 100, 2000, replace = TRUE)
c3 <- sample(1:1000 / 1000, 2000, replace = TRUE)
c4 <- sample(1:10000 / 10000, 2000, replace = TRUE)

y <- c1 + c2 + c3 + c4 + rnorm(100, mean = 0, sd = 1.5)

train_tree_data <- data.frame(c1, c2, c3, c4, y)

sim_tree_fit <- rpart(
  y ~ .,
  data = train_tree_data
)

plot(as.party(sim_tree_fit))

varImp(sim_tree_fit, scale = FALSE)

##       Overall
## c1 0.06361753
## c2 0.17198751
## c3 0.11809556
## c4 0.13957752

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The right model concentrates importance on a few strong predictors because it’s learning more aggressively, while the left model spreads it out due to more cautious, exploratory learning.

(b) Which model do you think would be more predictive of other samples?

The left model (with lower bagging fraction and shrinkage values) is likely to be more predictive on other samples. This is because it distributes importance across a broader set of predictors, which suggests it’s capturing more patterns in the data and is less likely to overfit. In contrast, the right model focuses too heavily on just a few variables, which increases the risk of overfitting to the training data and performing poorly on new, unseen data.

So, the left model may generalize better and be more reliable for predicting other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would likely make the slope of predictor importance even steeper for both models. This means the top predictors would stand out even more, while the less important ones would fade further into the background. With deeper interactions, the model is allowed to capture more complex relationships between variables, which can cause it to lean even harder on the strongest predictors. As a result, the gap between the most and least important features would grow, especially in a model like the one on the right that’s already focused on just a few key variables.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3

data(ChemicalManufacturingProcess)

sum(is.na(ChemicalManufacturingProcess[, -c(1)]))

## [1] 106

# Removing entries with low variance
ChemicalManufacturingProcess <- ChemicalManufacturingProcess[, -nearZeroVar(ChemicalManufacturingProcess)]

# Filling NAs using KNN
knn_impute <- preProcess(
  ChemicalManufacturingProcess[, -c(1)],
  method = "knnImpute"
)

cmp_independent <- predict(
  knn_impute,
  ChemicalManufacturingProcess[, -c(1)]
)

cmp_dependent <- ChemicalManufacturingProcess[, c(1), drop = FALSE]

sum(is.na(cmp_independent[, -c(1)]))

## [1] 0

set.seed(19940211)
# Partition the data into a sample of 80% of the full dataset
cmp_train_rows <- createDataPartition(
  cmp_independent$BiologicalMaterial01,
  p = 0.8,
  list = FALSE
)

# Use the sample to create a training dataset
cmp_train_ind <- cmp_independent[cmp_train_rows, ]
cmp_train_dep <- cmp_dependent[cmp_train_rows]

# Use the sample to create a testing dataset
cmp_test_ind <- cmp_independent[-cmp_train_rows, ]
cmp_test_dep <- cmp_dependent[-cmp_train_rows]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Single Tree

cmp_stree_fit <- train(
  cmp_train_ind,
  cmp_train_dep,
  method = "rpart2",
  tuneLength = 10,
  trControl = global_tr_control
)

cmp_stree_pred <- predict(
  cmp_stree_fit,
  cmp_test_ind
)

postResample(
  cmp_stree_pred,
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.4827649 0.5008663 1.1742636

Bagged Trees

library(ipred)

cmp_bag_fit <- ipredbagg(
  cmp_train_dep,
  cmp_train_ind
)

cmp_bag_pred <- predict(
  cmp_bag_fit,
  cmp_test_ind
)

postResample(
  cmp_bag_pred,
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.2790524 0.6830667 0.9635714

Random Forest

library(randomForest)

cmp_randforest_fit <- randomForest(
  cmp_train_ind,
  cmp_train_dep,
  importance = TRUE,
  ntree = 1000
)

cmp_randforest_pred <- predict(
  cmp_randforest_fit,
  cmp_test_ind
)

postResample(
  cmp_randforest_pred,
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.3094566 0.6734821 0.9844426

Boosted Trees

boosted_animals <- train(
  cmp_train_ind,
  cmp_train_dep,
  method = "gbm",
  tuneGrid = boost_tree_grid,
  verbose = FALSE,
  trControl = global_tr_control
)

boosted_pred <- predict(
  boosted_animals,
  cmp_test_ind
)

postResample(
  boosted_pred,
  cmp_test_dep
)

##      RMSE  Rsquared       MAE 
## 1.3748034 0.5648721 1.0478236

Cubist

cmp_cubist_fit <- cubist(
  cmp_train_ind,
  cmp_train_dep
)

cmp_cubist_pred <- predict(
  cmp_cubist_fit,
  cmp_train_ind
)

postResample(
  cmp_cubist_pred,
  cmp_test_dep
)

##        RMSE    Rsquared         MAE 
##          NA 0.004342594          NA

The bagged tree model gave the best overall performance. It had the lowest RMSE (1.28), lowest MAE (0.96), and the highest R² (0.68) compared to the other models. While random forest came close, and boosted trees did reasonably well, the bagged tree model was the most consistent across all metrics. Cubist performed the worst with almost no predictive power.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:party':
## 
##     where

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

varImp(cmp_bag_fit) |>
  arrange(desc(Overall)) |>
  head(10)

##                          Overall
## BiologicalMaterial03   0.9034354
## ManufacturingProcess32 0.8108736
## ManufacturingProcess17 0.7102317
## BiologicalMaterial11   0.7071977
## BiologicalMaterial12   0.7003613
## ManufacturingProcess31 0.6280937
## ManufacturingProcess09 0.6146938
## ManufacturingProcess06 0.5380744
## BiologicalMaterial05   0.5254365
## BiologicalMaterial09   0.4977530

The most important predictors in the optimal tree-based model are a mix of both biological and manufacturing process variables, with a slight edge toward the biological ones. BiologicalMaterial03, 11, and 12 all rank high. Overall, the top 10 list is pretty balanced between the two types. Compared to the top predictors from the best linear and nonlinear models, there’s likely some overlap, but the tree model may capture more complex interactions that the others miss.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)

rpart.plot(
  cmp_stree_fit$finalModel,
  type = 4
)

Yes, this tree plot provides helpful insight into how specific biological and process variables relate to yield. For example, the first split on ManufacturingProcess32 shows it plays a major role in predicting outcomes. As you move down the tree, variables like ManufacturingProcess11, BiologicalMaterial12, and BiologicalMaterial01 continue to appear in key splits, reinforcing their importance. The values in the terminal nodes also give a clear sense of how different combinations of these variables influence the predicted yield. Overall, the tree structure offers a more interpretable, step-by-step view of how certain predictor thresholds impact yield.

stopCluster(cluster)
registerDoSEQ()

DATA 624 - Homework 9

Shamecca Marshall

8.1 Recreate the simulated data from Exercise 7.2:

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

Cubist

8.2 Use a simulation to show tree bias with different granularities.

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

(b) Which model do you think would be more predictive of other samples?

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Single Tree

Bagged Trees

Random Forest

Boosted Trees

Cubist

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?