Data 624 Assignment 9

library(tidyverse)
library(base)
library(mlbench)
library(caret)
library(MASS)
library(earth)
library(AppliedPredictiveModeling)
library(randomForest)
library(party)
library(rpart)
library(gbm)
library(ranger)
library(ggplot2)
library(rpart.plot)

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

8.1

Recreate the simulated data from Exercise 7.2

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a. Fit a random forest model to all predictors and estimate variable importance scores using the randomForest package.

model1 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##         Overall
## V1   8.62743275
## V2   6.27437240
## V3   0.72305459
## V4   7.50258584
## V5   2.13575650
## V6   0.12395003
## V7   0.02927888
## V8  -0.11724317
## V9  -0.10344797
## V10  0.04312556

Did the random forest model signifanctly use the uninformative predictors (V6 - V10)?

The random forest did not significantly use predictors (V6 - V10), their values are all realatively close to 0, meaning their influence is minimal.

b. Add a new predictor that is highly correlated with an existing predictor, and compute the correlation between them.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9485201

There is a high correlation of 94% between existing predictor (V1) and a new predictor (duplicate1)

Fit another random forest model to these data. Did the importance score for V1 change?

model2 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                 Overall
## V1          6.774034589
## V2          6.426340527
## V3          0.613805379
## V4          7.135941576
## V5          2.135242904
## V6          0.171933358
## V7          0.142238552
## V8         -0.073192083
## V9         -0.098719872
## V10        -0.009701234
## duplicate1  3.084990840

The importance score lowered for V1 from 8.62 to 6.77, while the score for V4 increased from 7.50 to 7.13 and became more important than V1.

What happens when you add another predictor that is also highly correlated with V1?

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9337221

The correlation between existing predictor V1 and the second new predictor is above 93%.

model3 <- randomForest(y ~ ., data = simulated, 
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

##                 Overall
## V1          5.908641677
## V2          6.586726939
## V3          0.559845667
## V4          7.373782389
## V5          1.987341138
## V6          0.162417814
## V7          0.038423138
## V8          0.007497423
## V9         -0.001806331
## V10         0.004023755
## duplicate1  2.351543736
## duplicate2  2.305339113

The importance of V1 has decreased to 5.9 with the addition of a second predictor.

C. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4

##           V1           V2           V3           V4           V5           V6 
##  5.335045520  5.312801960  0.034111050  6.679015663  1.146860652 -0.015403265 
##           V7           V8           V9          V10 
##  0.009873139 -0.008375437 -0.012514552 -0.031028246

When parameter conditional is TRUE, the importance of V1 is lower and does not follow the pattern of the traditional random forest model.

d.Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

library(gbm)
gbmModel <- gbm(y ~ ., 
                data = simulated, 
                distribution = "gaussian")

summary.gbm(gbmModel)

##                   var    rel.inf
## V4                 V4 30.0120572
## V1                 V1 26.2381387
## V2                 V2 22.7009480
## V5                 V5 10.6886824
## V3                 V3  8.2961175
## duplicate1 duplicate1  1.5932898
## duplicate2 duplicate2  0.2523780
## V6                 V6  0.2183885
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

For boosted trees, a slightly different pattern emerges. Predictor variable V1 has the second highest relative influence,followed by V2. Duplicates and Predictors V6-V10 have little to no relative influence.

Cubist

cubistModel <- train(y ~ .,
                     data = simulated[,c(1:11)],
                     method = "cubist")

varImp(cubistModel$finalModel, scale = FALSE)

##     Overall
## V1     72.0
## V3     42.0
## V2     54.5
## V4     49.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

In the cubist model, V1 remains the key variable, but here, V3 is considered more important than V2. V7 to V10 are completely irrelevant, while V6 holds a bit more significance in this model. However, V6’s importance is still considerably greater than that of the other significant variables.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

set.seed(123)
# Create predictors with different granularities using sample()
x1 <- sample(1:120, 100, replace = TRUE)  # high granularity (1-120)
x2 <- sample(1:40, 100, replace = TRUE)   # medium granularity (1-40)
x3 <- sample(1:8, 100, replace = TRUE)    # low granularity (1-8)

# Response variable, independent of predictors but with random noise
y <- rnorm(100, mean = 50, sd = 10)

df <- data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)

library(rpart)
tree_model <- rpart(y ~ ., data = df)

# Extract variable importance
var_imp <- tree_model$variable.importance

# Normalize importances so max is 100 for easier comparison
var_imp_norm <- 10 * var_imp / max(var_imp)

print("Variable Importance (Normalized):")

## [1] "Variable Importance (Normalized):"

var_imp_norm

##        x2        x1        x3 
## 10.000000  4.726867  1.932806

Observations:

Even though the response variable y was generated independently of these variables, the decision tree assigns importance based on the variables’ granularity (number of unique values). Here, x2, which has medium granularity (fewer unique values than x1 but more than x3), appears most important, followed by x1, and then x3 with the lowest granularity.

This output illustrates a common bias in decision trees where variables with more distinct values or certain data characteristics appear more “important” due to how the splits reduce impurity, even when the variables are not genuinely predictive.

Variable importance from decision trees should be interpreted with caution, and additional validation or complementary methods may be necessary to confirm true predictor relevance.

Exercise 8.3

Stochastic Gradient Boosting

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Predictor importance

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Answer: In stochastic gradient boosting, the regression trees generated at each iteration are influenced by the outcomes of the preceding iteration, which distinguishes this method from random forests. The parameters that regulate this dependency are the bagging fraction and the learning rate. Figure 8.24 illustrates the impact of these tuning parameters. When both the learning rate (the proportion of the predicted value retained) and the bagging fraction (the proportion of the bootstrapped data sample utilized in each iteration) are set to a high value of 0.9, only a limited number of predictors exhibit significant importance. Conversely, as the learning and bagging rates are decreased, the decline in predictor importance becomes less pronounced. Therefore, it is essential to adjust the bagging rate and learning rate when developing a stochastic gradient boosting model.

(b) Model predictive power

Which model do you think would be more predictive of other samples?

Answer: While the model on the left may incur higher computational costs, it is likely to yield superior predictions due to its greater generalization and reduced bias. Although a lower learning rate is generally favored, it necessitates additional iterations, thus requiring a careful balance to be struck.

(c) Interaction depth

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Answer: Boosting the interaction depth would lead to a greater reliance on the less significant predictors, resulting in a more gradual slope.

Exercise 8.7

Chemical manufacturing process

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
dim(ChemicalManufacturingProcess)

## [1] 176  58

Impute missing values

# Impute missing values using k-Nearest Neighbors imputation

set.seed(123) # for reproducibility

chem_impute <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute"))

df <- predict(chem_impute, ChemicalManufacturingProcess) # Apply imputation

Prepare data

# Prepare data for modeling
dfx <- subset(df,select=-Yield)# Predictor variables
dfy <- subset(df,select=Yield) # Target variable

Split Data

# Split data into training and testing sets

set.seed(123) # Ensure consistent data partitioning
chem_train <- createDataPartition(dfy$Yield, p = 0.80, list = FALSE)

x_train <- dfx[chem_train, ]
x_test <- dfx[-chem_train, ]
y_train <- dfy[chem_train, ]
y_test <- dfy[-chem_train, ]

Train several tree-based models:

# Combine training and target for caret

train_df <- cbind(x_train, Yield = y_train)

# Set up cross-validation control

fit_control <- trainControl(method = "cv", number = 5)

1. Decision Tree

set.seed(123)

model_rpart <- train(Yield ~ ., data = train_df, 

                     method = "rpart", 

                     trControl = fit_control,

                     tuneLength = 10)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

2. Random Forest

set.seed(123)

model_rf <- train(Yield ~ ., data = train_df, 

                  method = "rf", 

                  trControl = fit_control,

                  tuneLength = 5)

3. Gradient Boosting Machine (gbm)

set.seed(123)

model_gbm <- train(Yield ~ ., data = train_df, 

                   method = "gbm", 

                   trControl = fit_control,

                   verbose = FALSE,

                   tuneLength = 5)

# Gather performance

resamp <- resamples(list(

  Decision_Tree = model_rpart,

  Random_Forest = model_rf,

  GBM = model_gbm

))

summary(resamp)

## 
## Call:
## summary.resamples(object = resamp)
## 
## Models: Decision_Tree, Random_Forest, GBM 
## Number of resamples: 5 
## 
## MAE 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Decision_Tree 0.5042686 0.5180304 0.5736425 0.5777800 0.6197479 0.6732105    0
## Random_Forest 0.4155842 0.4321066 0.4504947 0.4902054 0.5432425 0.6095991    0
## GBM           0.4774527 0.4934104 0.4950587 0.5231962 0.5498396 0.6002198    0
## 
## RMSE 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Decision_Tree 0.6867645 0.6876721 0.7092115 0.7539234 0.8191198 0.8668490    0
## Random_Forest 0.5422546 0.5516526 0.6110646 0.6301385 0.6970029 0.7487178    0
## GBM           0.6097555 0.6166548 0.6494356 0.6581330 0.6713063 0.7435130    0
## 
## Rsquared 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Decision_Tree 0.3498574 0.3930898 0.4385399 0.4604911 0.5122564 0.6087121    0
## Random_Forest 0.4983423 0.5977887 0.6385069 0.6320155 0.6796997 0.7457400    0
## GBM           0.4832002 0.5060655 0.6199565 0.5864537 0.6250422 0.6980042    0

bwplot(resamp, metric = "RMSE")

(a) Best model

Which tree-based regression model gives the optimal resampling and test set performance?

# Predict and calculate test set RMSE and R^2

library(Metrics)

## Warning: package 'Metrics' was built under R version 4.3.3

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

model_list <- list(Decision_Tree = model_rpart, Random_Forest = model_rf, 

                   GBM = model_gbm)

test_performance <- sapply(model_list, function(mod) {

  pred <- predict(mod, x_test)

  rmse <- sqrt(mean((y_test - pred)^2))

  r2 <- cor(y_test, pred)^2

  c(RMSE = rmse, R2 = r2)

})

test_performance

##      Decision_Tree Random_Forest       GBM
## RMSE     0.9959841     0.6586840 0.6086936
## R2       0.1721357     0.6038714 0.6514447

Based on the lowest RMSE values and highest R^2, Random Forest and GBM outperform Decision tree model.

(b) Predictor importance

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Let’s use GBM as optimal model since it has the lowest RMSE and the highest R^2 values among all of the models.

best_model <- model_gbm 

# Plot Variable Importance

varimp <- varImp(best_model, scale = FALSE)

plot(varimp, top = 20)

# Print top 10 predictors

top_preds <- as.data.frame(varimp$importance)

top10 <- head(top_preds[order(-top_preds$Overall), , drop=FALSE], 10)

top10

##                          Overall
## ManufacturingProcess32 128.72336
## ManufacturingProcess09  33.71834
## BiologicalMaterial06    32.08112
## BiologicalMaterial12    30.28647
## ManufacturingProcess17  26.20252
## ManufacturingProcess13  25.65664
## ManufacturingProcess15  21.68391
## BiologicalMaterial03    16.22725
## ManufacturingProcess06  14.92269
## ManufacturingProcess37  13.91883

Manufacturing Process 32 and 09 are the most important in Gradient Boosting Machine model and it seems like in general manufacturing process variables are dominate.

Comparison with linear and nonlinear models:

# Linear model for reference

set.seed(123)

model_lm <- train(Yield ~ ., data=train_df, method="lm", trControl=fit_control)

# Nonlinear SVM 

set.seed(123)

model_svm <- train(Yield ~ ., data=train_df, method="svmRadial", 

                   trControl=fit_control, tuneLength=5)

# Get variable importance

vimp_lm <- varImp(model_lm, scale=FALSE)

vimp_svm <- varImp(model_svm, scale=FALSE)

top10_lm <- head(as.data.frame(vimp_lm$importance)[order(-vimp_lm$importance$Overall),, drop=FALSE], 10)

top10_svm <- head(as.data.frame(vimp_svm$importance)[order(-vimp_svm$importance$Overall),, drop=FALSE], 10)

Variable importance for Linear model

top10_lm

##                         Overall
## ManufacturingProcess37 3.215417
## ManufacturingProcess29 3.145689
## ManufacturingProcess45 2.295809
## ManufacturingProcess32 2.091999
## ManufacturingProcess04 2.051229
## BiologicalMaterial05   2.013918
## ManufacturingProcess43 1.950541
## ManufacturingProcess06 1.732567
## ManufacturingProcess28 1.719629
## BiologicalMaterial03   1.602956

It looks like Manufacturing Process variables dominate in linear model. We see ManufacturingProcess37 on the top of all variables.

Variable importance for Non-linear model

top10_svm

##                          Overall
## ManufacturingProcess32 0.3704202
## BiologicalMaterial06   0.3484339
## BiologicalMaterial03   0.3010236
## ManufacturingProcess13 0.2986765
## ManufacturingProcess36 0.2932782
## ManufacturingProcess31 0.2846364
## BiologicalMaterial02   0.2816569
## ManufacturingProcess17 0.2812153
## ManufacturingProcess09 0.2705700
## BiologicalMaterial12   0.2573729

Manufacturing Process 32 on a top, followed by Biological Material 06 and 03 for non-linear model. Very similar to gbm variable importance.

(c) Visualization

Plot the optimal single tree with the distribution of yield in the terminal nodes.

Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

tree <- rpart(Yield ~ ., data=train_df, method="anova")
rpart.plot(tree, type=4, extra=101, box.palette="Blues", main="Decision Tree with Yield Distribution in Terminal Nodes")

Yes, the decision tree illustrates how various combinations of predictor variables impact the yield.

Each branch signifies a choice based on the value of a specific predictor, leading to terminal nodes that display the average yield for that particular data group. For instance, if BiologicalMaterial03 is below -0.33 and ManufacturingProcess17 is at least -0.4, the expected yield is around -0.78, based on 59 observations (which is 41% of that subset).

The tree highlights the most significant predictors (those positioned higher up) and shows how their interactions influence the yield. The terminal nodes give yield estimates for different predictor value combinations, revealing insights that aren’t obvious when looking at the predictors individually. Thus, the decision tree offers valuable insights into the relationships between predictors and yield.