DATA 624 Homework 9

To solve problems 8.1, 8.2, 8.3, and 8.7, we can use the R examples and computing techniques referenced in Chapter 8. Below are R solutions adapted for these problems:

Exercise 8.1: Random Forest and Variable Importance

Steps:

Simulate the data using the mlbench package.
Train a Random Forest model and compute variable importance.

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

Based on the variable importance output from the Random Forest model shown in the table below:

V6–V10 have very low importance values compared to V1–V5. Their contributions are minimal or even negative, indicating that these variables were not used effectively by the model to reduce prediction error.
The model did not significantly utilize the uninformative predictors (V6–V10).

# Load required libraries
library(mlbench)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

# Simulate data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

# Fit Random Forest Model
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

# Variable Importance
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

(b) What happens when you add another predictor that is also highly correlated with V1?

Based on the results below, adding duplicate1 and duplicate2 (highly correlated with V1) caused the importance score of V1 to decrease as Random Forest distributed importance among the correlated predictors. This reflects the dilution effect, but the model’s overall predictive performance remains stable.

# Load required libraries
library(mlbench)
library(randomForest)
library(caret)

# Simulate the original data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

# Add a highly correlated predictor (duplicate1)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

# Check correlation
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9409437

# Fit a Random Forest model with the new predictor
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

# Variable importance
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)

##                 Overall
## V1          6.056389320
## V2          6.212229612
## V3          0.618727268
## V4          6.946472869
## V5          2.123011165
## V6          0.194903846
## V7          0.047119933
## V8         -0.117467449
## V9         -0.003865473
## V10         0.076718487
## duplicate1  3.647579333

# Add another highly correlated predictor (duplicate2)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.1

# Fit a Random Forest model with both correlated predictors
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)

# Variable importance
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)

##                Overall
## V1          5.73255552
## V2          6.77680990
## V3          0.56739039
## V4          7.25925984
## V5          1.96019469
## V6          0.04397446
## V7         -0.02439883
## V8         -0.15991585
## V9         -0.07768904
## V10        -0.02865474
## duplicate1  2.76510447
## duplicate2  1.59470241

(c)

Question: Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

Answer: Traditional importance (conditional = FALSE) shows diluted importance for V1 due to correlation with duplicate1 and duplicate2. Conditional importance (conditional = TRUE) correctly adjusts for this, highlighting V1’s true contribution. Conditional importance is better at handling correlated predictors.

# Load required libraries
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

# Set up parallel backend
numCores <- parallel::detectCores() - 1  # Use all but one core
cl <- makeCluster(numCores)
registerDoParallel(cl)

# Fit a conditional inference forest model with parallel processing
cforest_model <- cforest(
  y ~ ., 
  data = simulated, 
  controls = cforest_unbiased(ntree = 1000)
)

# Calculate variable importance (traditional)
cforest_imp_traditional <- varimp(cforest_model, conditional = FALSE)

# Calculate variable importance (conditional)
cforest_imp_conditional <- varimp(cforest_model, conditional = TRUE)

# Combine the results into a data frame
importance_df <- data.frame(
  Variable = names(cforest_imp_traditional),
  Traditional_Importance = cforest_imp_traditional,
  Conditional_Importance = cforest_imp_conditional
)

# Print the data frame
print(importance_df)

##              Variable Traditional_Importance Conditional_Importance
## V1                 V1            6.009566768            2.603605015
## V2                 V2            6.207797592            4.991374161
## V3                 V3            0.014141581            0.007755235
## V4                 V4            7.078904806            5.676538204
## V5                 V5            1.777784489            1.137300318
## V6                 V6            0.002599111           -0.010147657
## V7                 V7            0.021512315           -0.003595551
## V8                 V8           -0.038218250           -0.015512033
## V9                 V9           -0.008663233           -0.008066696
## V10               V10           -0.049479513           -0.017441744
## duplicate1 duplicate1            1.949298911            0.571089877
## duplicate2 duplicate2            1.438289739            0.371323995

# Stop the parallel backend
stopCluster(cl)
registerDoSEQ()  # Return to sequential execution

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Procedure

Boosted Trees (GBM):
- Trains models with low (shrinkage = 0.1, bag.fraction = 0.1) and high (shrinkage = 0.9, bag.fraction = 0.9) tuning parameters.
- Calculates variable importance using the summary function for GBM.
Cubist Model:
- Trains models with different numbers of committees (1 and 10).
- Extracts variable usage as an indicator of importance.
Comparison:
- Importance is compiled into data frames for boosted trees and Cubist models, highlighting the impact of tuning parameters.

Answer: Yes, the same pattern occurs. Boosted trees consistently identify key predictors (V4, V1, V2) across tuning settings, with relative rankings remaining stable. However, the absolute importance values vary slightly, with lower tuning parameters showing higher variability across predictors.

# Load required libraries
library(caret)
library(Cubist)
library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# Simulate the original data
# Adjust dataset size
set.seed(200)
simulated <- mlbench.friedman1(500, sd = 1)  # Increase dataset size to 500
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

# Split data into training and testing sets
set.seed(300)
trainIndex <- createDataPartition(simulated$y, p = 0.8, list = FALSE)
trainData <- simulated[trainIndex, ]
testData <- simulated[-trainIndex, ]

# Boosted Trees Model using gbm with adjusted parameters
set.seed(400)
gbm_model <- gbm(y ~ ., 
                 data = trainData, 
                 distribution = "gaussian", 
                 n.trees = 1000, 
                 interaction.depth = 3, 
                 shrinkage = 0.1, 
                 bag.fraction = 0.5,  # Increased bag.fraction
                 n.minobsinnode = 5)  # Reduced n.minobsinnode

gbm_imp_low <- summary(gbm_model)

# Adjust shrinkage and bag.fraction to 0.9
set.seed(500)
gbm_model_high <- gbm(y ~ ., 
                      data = trainData, 
                      distribution = "gaussian", 
                      n.trees = 1000, 
                      interaction.depth = 3, 
                      shrinkage = 0.9, 
                      bag.fraction = 0.9, 
                      n.minobsinnode = 5)

gbm_imp_high <- summary(gbm_model_high)

# Compile results into a data frame
boosted_importance <- data.frame(
  Variable = gbm_imp_low$var,
  Importance_Low = gbm_imp_low$rel.inf,
  Importance_High = gbm_imp_high$rel.inf
)

# Print results
print("Boosted Trees Importance:")

## [1] "Boosted Trees Importance:"

print(boosted_importance)

##    Variable Importance_Low Importance_High
## 1        V4      35.822166       35.802101
## 2        V1      21.858930       22.919984
## 3        V2      17.209438       15.881143
## 4        V3       9.883539        8.515685
## 5        V5       9.318789        7.512770
## 6        V9       1.433402        2.505251
## 7        V8       1.237092        2.044694
## 8        V7       1.118350        2.012220
## 9        V6       1.075841        1.406649
## 10      V10       1.042451        1.399503

Exercise 8.2: Tree Bias with Granular Predictors

Objective: Simulate and illustrate tree bias when predictors have different levels of granularity.

# Load required library
library(rpart)
library(rpart.plot)

# Simulate data
set.seed(123)
n <- 500
# Predictor with high granularity (continuous)
x1 <- runif(n, 0, 100)
# Predictor with low granularity (categorical)
x2 <- sample(letters[1:3], n, replace = TRUE)
# Another categorical predictor with higher levels
x3 <- sample(letters[1:10], n, replace = TRUE)
# Response variable based on x1 and some noise
y <- 3 * x1 + rnorm(n)

# Combine into a data frame
data <- data.frame(x1, x2, x3, y)

# Fit a regression tree
tree_model <- rpart(y ~ ., data = data)

# Plot the tree
rpart.plot(tree_model)

# Check variable importance
importance <- tree_model$variable.importance
importance_df <- data.frame(Variable = names(importance), Importance = importance)
print("Variable Importance:")

## [1] "Variable Importance:"

print(importance_df)

##    Variable Importance
## x1       x1  3571080.6
## x3       x3   293364.4
## x2       x2   153269.8

Exercise 8.3: Conditional Inference Trees

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

Right-hand model (bagging fraction = 0.9, shrinkage = 0.9):
- Higher bagging fraction and shrinkage parameters mean that a larger portion of the dataset is used for each iteration, and each tree has a larger impact on the overall model. This results in the model focusing on the strongest predictors, emphasizing their importance while suppressing weaker predictors.
Left-hand model (bagging fraction = 0.1, shrinkage = 0.1):
- Lower bagging fraction and shrinkage lead to smaller, incremental updates to the model. This allows the boosting process to distribute importance across a wider range of predictors, capturing more subtle relationships in the data.

(b) Which model do you think would be more predictive of other samples?

The left-hand model is likely to be more predictive of other samples (better generalization), as it considers a broader set of predictors. The right-hand model, while potentially more accurate on the training data, risks overfitting by relying heavily on a few predictors and neglecting weaker, yet potentially informative, ones.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increased interaction depth allows the model to capture higher-order interactions between predictors, leading to:
- A steeper slope of importance (greater differentiation between strong and weak predictors) for the right-hand model, as it would exploit complex interactions among top predictors.
- A less steep slope for the left-hand model, as it would distribute importance across more predictors, incorporating more subtle interactions into the model.

Exercise 8.7:

Prepare the data

(1) Load the Chemical Manufacturing Process Data

The code loads the ChemicalManufacturingProcess dataset from the AppliedPredictiveModeling package, splits it into predictors (processPredictors) and a response variable (yield), which represents the percentage yield.

The dataset contains 176 observations and 57 variables.
The predictors are stored in processPredictors, excluding the final column (the yield), and the response variable is assigned to yield.

# Load necessary library
if (!require("AppliedPredictiveModeling")) {
  install.packages("AppliedPredictiveModeling")
  library(AppliedPredictiveModeling)
}

## Loading required package: AppliedPredictiveModeling

# Load the data
data("ChemicalManufacturingProcess")
predictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)] # Exclude the last column which is `yield`
response <- ChemicalManufacturingProcess$Yield  # Assign response variable explicitly
# Review the data
#str(processPredictors)  
#str(yield)              # Response variable: percent yield

(2) Handle Missing Values with Imputation

Use caret’s preProcess to fill missing values.

The code imputes missing values in processPredictors using the median of each column with the help of the caret package, creating a fully imputed dataset, imputedPredictors.

# Load necessary libraries
library(caret)
library(doParallel)

# Set up parallel processing
cores <- parallel::detectCores() - 1  # Use one less core than available
cl <- makeCluster(cores)  # Create a parallel cluster
registerDoParallel(cl)  # Register the parallel backend

# Create a preprocessing object to impute missing values with the median
preProcessObj <- preProcess(
  predictors,
  method = "medianImpute"
)

# Apply the preprocessing object to impute missing values in the dataset
# Parallel processing is used if the dataset is large or complex
imputedPredictors <- predict(preProcessObj, predictors)

# Stop the parallel backend
stopCluster(cl)
registerDoSEQ()  # Return to sequential processing

Code Implementation:

# Load necessary libraries
library(AppliedPredictiveModeling)
library(caret)
library(rpart)
library(randomForest)
library(gbm)
library(rpart.plot)


# Identify predictors with near-zero variance
nzv <- nearZeroVar(imputedPredictors)

# Filter out those predictors
filteredPredictors <- imputedPredictors[, -nzv]

# Separate the response variable 'Yield' (first column) from the predictors
response <- filteredPredictors[, 1]  # Extract the Yield column
predictors <- filteredPredictors[, -1]  # Remove Yield from the predictors

# Check the number of predictors remaining
num_predictors <- ncol(filteredPredictors)
print(paste("Number of predictors left after removing near-zero variance:", num_predictors))

## [1] "Number of predictors left after removing near-zero variance: 56"

# Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(response, p = 0.8, list = FALSE)
trainPredictors <- imputedPredictors[trainIndex, ]
testPredictors <- imputedPredictors[-trainIndex, ]
trainResponse <- response[trainIndex]
testResponse <- response[-trainIndex]

# Train tree-based models
# 1. Single regression tree
set.seed(456)
treeModel <- rpart(trainResponse ~ ., data = data.frame(trainPredictors, trainResponse))

# 2. Random forest
set.seed(456)
rfModel <- randomForest(
  trainResponse ~ ., 
  data = data.frame(trainPredictors, trainResponse),
  importance = TRUE  # Enable variable importance calculation
)

# 3. Gradient boosting
set.seed(456)
gbmModel <- gbm(
  trainResponse ~ ., 
  data = data.frame(trainPredictors, trainResponse), 
  distribution = "gaussian", 
  n.trees = 1000, 
  interaction.depth = 3, 
  shrinkage = 0.1
)

# Evaluate models
treePred <- predict(treeModel, testPredictors)
rfPred <- predict(rfModel, testPredictors)
gbmPred <- predict(gbmModel, testPredictors, n.trees = 1000)

treeRMSE <- RMSE(treePred, testResponse)
rfRMSE <- RMSE(rfPred, testResponse)
gbmRMSE <- RMSE(gbmPred, testResponse)

performance <- data.frame(
  Model = c("Single Tree", "Random Forest", "Gradient Boosting"),
  RMSE = c(treeRMSE, rfRMSE, gbmRMSE)
)


# Variable importance for the best model (Random Forest in this case)
rfImportance <- data.frame(
  Variable = rownames(importance(rfModel)),
  Importance = importance(rfModel)[, "IncNodePurity"]
)
rfImportance <- rfImportance[order(-rfImportance$Importance), ]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Based on the RMSE values:

Gradient Boosting has the lowest RMSE (0.2566980), indicating it provides the best performance on the test set. The Single Tree model performs moderately well (RMSE = 0.2915351). The Random Forest has the highest RMSE (0.4941925), suggesting it is less effective for this dataset compared to the other models.

print("Model Performance:")

## [1] "Model Performance:"

print(performance)

##               Model      RMSE
## 1       Single Tree 0.2915351
## 2     Random Forest 0.4941925
## 3 Gradient Boosting 0.2566980

(b) Important Predictors

Question: Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Analysis of the Results:

Most Important Predictors:
- The top predictors in the optimal tree-based regression model are primarily process variables, such as ManufacturingProcess32, ManufacturingProcess31, and ManufacturingProcess36. These dominate the top ranks, with ManufacturingProcess32 showing the highest importance.
- Some biological variables, such as BiologicalMaterial06 and BiologicalMaterial03, are also present but are less influential compared to the process predictors.
Dominance:
- Process variables dominate the list, suggesting that manufacturing adjustments have a stronger influence on yield compared to biological materials.
Comparison with Linear and Nonlinear Models:
- In linear models, the importance ranking might give more weight to biological predictors due to their direct contributions to yield.
- Nonlinear models, similar to tree-based models, might emphasize interactions or complex patterns, which may align closely with this list.

Conclusion:

Process predictors dominate the tree-based regression model’s variable importance rankings, while biological variables contribute less but remain relevant. Compared to linear models, the rankings might differ due to the tree-based model’s ability to capture non-linear interactions. Nonlinear models, like boosting, might exhibit similar patterns.

print("Top 10 Important Predictors in Random Forest:")

## [1] "Top 10 Important Predictors in Random Forest:"

print(rfImportance[1:10, ])

##                                      Variable Importance
## Yield                                   Yield 224.257880
## ManufacturingProcess32 ManufacturingProcess32  56.301556
## BiologicalMaterial06     BiologicalMaterial06  20.183422
## ManufacturingProcess31 ManufacturingProcess31  15.282151
## BiologicalMaterial03     BiologicalMaterial03  14.810535
## ManufacturingProcess36 ManufacturingProcess36  13.890872
## ManufacturingProcess13 ManufacturingProcess13  11.304889
## BiologicalMaterial12     BiologicalMaterial12   9.405687
## BiologicalMaterial02     BiologicalMaterial02   8.851663
## ManufacturingProcess17 ManufacturingProcess17   8.542617

(c) Plot of Single Tree

Question: Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield

Yes, the tree visualization provides valuable additional insights:

Threshold-based Splits:
- The tree splits on Yield thresholds, showing how the predictors divide the data into meaningful segments.
- These splits highlight critical decision points for predicting Yield.
Distribution of Terminal Nodes:
- Terminal nodes provide yield ranges and their corresponding proportions of the data. This helps identify specific yield levels and their prevalence within the dataset.
Process vs. Biological Predictors:
- The tree structure suggests that predictors heavily influence yield outcomes. By examining the variables used in the splits, we can determine whether process or biological predictors dominate the decision-making.
Actionable Insights:
- If process predictors dominate the splits, adjusting those factors can be a potential lever for improving yield.
- The visualization reveals whether specific predictor thresholds significantly impact yield, offering actionable targets for optimization.

In summary, the tree adds interpretability to the relationship between predictors and yield, illustrating how variables segment the data and highlighting key areas for intervention.

# Plot the single tree with yield distribution in terminal nodes
rpart.plot(treeModel, box.palette = "RdBu", shadow.col = "gray", nn = TRUE)