DATA 624 RB HW9

Questions 8.1, 8.2, 8.3, and 8.7 from the Kuhn& Johnson book

library(caret)
library(corrplot)
library(Cubist)
library(dplyr)
library(data.table)
library(e1071)
library(earth)
library(fable)
library(forecast)
library(fpp2)
library(fpp3)
library(ggplot2)
library(GGally)
library(gbm)
library(knitr)
library(kernlab)
library(latex2exp) 
library(mlbench)
library(MASS)
library(purrr)
library(party)
library(pls)
library(patchwork)
library(rpart)
library(rpart.plot)
library(RANN) 
library(randomForest)
library(stats)
library(tsibble)
library(tidyr)

——————————————————————————–

(8.1) Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)


# Variable Importance
# Convert importance scores to a data frame
rf_imp_df <- data.frame(Variables = rownames(rfImp1), Importance = rfImp1$Overall)

# Order by importance
rf_imp_df <- rf_imp_df[order(rf_imp_df$Importance, decreasing = TRUE), ]

# Plot
ggplot(rf_imp_df, aes(x = reorder(Variables, Importance), y = Importance)) +
  geom_col(fill = "#228B22") +
  coord_flip() +
  labs(title = "Random Forest Variable Importance Score")

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No- the random forest model made adequate partitions in each decision tree and accurately weighted the accuracy of predictions. So the models adjusted well to focusing on the predictors that provided the most information gain, or the smallest gradient, and accurately weighed them at the higher positions.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

# Variable Importance
# Convert importance scores to a data frame
rf_imp_df2 <- data.frame(Variables = rownames(rfImp2), Importance = rfImp2$Overall)

# Order by importance
rf_imp_df2 <- rf_imp_df2[order(rf_imp_df2$Importance, decreasing = TRUE), ]

# Plot
ggplot(rf_imp_df2, aes(x = reorder(Variables, Importance), y = Importance)) +
  geom_col(fill = "#228B22") +
  coord_flip() +
  labs(title = "Random Forest 2 Variable Importance Score")

Since the new variable had a 0.9437132 correlation with V1, it raised V4 and V2 from 2nd and 3rd, to 1st and 2nd respectively. This dropped the importance of V1 from 8.56 to 5.16. It sits right next to its duplicate. Interestingly this duplicate value affects on V4 by raising its rank, but lowering the score. V4 became the most important predictor, but with a smaller value. In random forest 1, V4’s score was 7.73, but in random forest 2, V4’s score was 5.59.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

# conditional inference trees  (Pg. 182)
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varimp(model3, conditional = TRUE)

# Convert to data frame and clean up names
rf_imp_df3 <- data.frame(Variables = names(rfImp3), Importance = as.numeric(rfImp3))

# Plot
ggplot(rf_imp_df3, aes(x = reorder(Variables, Importance), y = Importance)) +
  geom_col(fill = "#228B22") +
  coord_flip() +
  labs(title = "Random Forest 3 Variable Importance Score")

Yes, it mimics the behavior of the prior tree exactly, but it stops after a tree depth of 5 steps.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

model_cubist <- cubist(simulated[, colnames(simulated)[colnames(simulated) != 'y']], simulated$y)

cubist_imp <- varImp(model_cubist, scale = FALSE)

cubist_imp %>% 
  mutate (var = rownames(cubist_imp)) %>%
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) + 
  geom_col(fill = '#228B22') + 
  labs(title = 'Cubist Variable Importance Score ' )

As expected, Cubist used linear combinations from predictions of each iterative tree, so the important predictors are selected and V6-10 are dropped. The importance of each variable is equivalent. Since cubist fits a linear regression model at each terminal node, and each model looks at how often a variable is used in a rules conditions, it aggregates the usage of each variable across all rules and models. It then scales the scores of each, and in this case, these 4 models were the only significant ones used.

——————————————————————————–

(8.2) Use a simulation to show tree bias with different granularities.

library(party)
library(ggplot2)
library(dplyr)

####################################
set.seed(12345)

simulated2 <- mlbench.friedman2(150, sd = 2)
simulated2 <- cbind(simulated2$x, simulated2$y)
simulated2 <- as.data.frame(simulated2)
colnames(simulated2)[ncol(simulated2)] <- "y"

cor(simulated2)

##             V1          V2          V3          V4           y
## V1  1.00000000 -0.00444964  0.09962107 -0.17208090  0.08398245
## V2 -0.00444964  1.00000000 -0.11637790 -0.05896942  0.57424051
## V3  0.09962107 -0.11637790  1.00000000 -0.01241795  0.65472592
## V4 -0.17208090 -0.05896942 -0.01241795  1.00000000 -0.08072369
## y   0.08398245  0.57424051  0.65472592 -0.08072369  1.00000000

# Fit the model
model4 <- cforest(y ~ ., data = simulated2)

# Get variable importance
rfImp4 <- varimp(model4, conditional = FALSE) %>% as.data.frame()

# VIS Plot
rfImp4 %>% 
  rename(Overall = '.') %>%
  mutate (var = rownames(rfImp4)) %>%
  ggplot(aes(Overall, reorder(var, Overall, sum), var)) + 
  geom_col(fill = '#228B22') + 
  labs(title = 'Newly Simulated Variable Importance' )

####################################
# Random
V1 <- rnorm(200, 2, 10)
V2 <- V1 + rnorm(200, 0, 2)
V3 <- V2 * runif(200, 0.75, 1.25) 
y <- V1 + V2
random_df <- data.frame(V1, V2, V3, y)

cor(random_df)

##           V1        V2        V3         y
## V1 1.0000000 0.9831294 0.9757395 0.9956457
## V2 0.9831294 1.0000000 0.9902429 0.9958993
## V3 0.9757395 0.9902429 1.0000000 0.9872716
## y  0.9956457 0.9958993 0.9872716 1.0000000

# Fit the model
controls <- cforest_unbiased(mtry = 2, ntree = 500)
model5 <- cforest(y ~ ., data = random_df, controls = controls)

# Get variable importance
rfImp5 <- varimp(model5, conditional = FALSE) %>% as.data.frame()

# Rename column 
rfImp5 <- rfImp5 %>%
  rename(Importance = ".") %>%
  mutate(Variable = rownames(rfImp5))

# VIS Plot
ggplot(rfImp5, aes(x = Importance, y = reorder(Variable, Importance))) + 
  geom_col(fill = '#228B22') + 
  labs(title = 'Newly Simulated Variable Importance 2') +
  theme_minimal()

——————————————————————————–

(8.3) In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction and learning rate, both adjustable parameters, have deep affects on the model. By lowering the bagging fraction, each decision tree only trains on 10% of the data, which makes the model generalize poorly. This model is forced to generalize on smaller random subset of the data, which causes variance within the trees. As you raise it, you incur overfitting on the data, so more focus on important predictors. The learning rate controls the contribution of each trees weight and vote on the final model,so a high rate means each tree contributes more. A higher rate rightfully detected that the most important predictors are numCarbon, MolWeight, and the SurfaceArea predictors.

(b) Which model do you think would be more predictive of other samples?

The left model is more predictive, as by lowering the bagging fraction you allow the model to train on more varied data, and the learning rate makes the distribution of votes for each prediction diverse. Hence the model will train to more unimportant predictors, and then be more predictive to other observations.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction depth is the parameter that impacts the maximum depth of decision trees, and the higher the depth, the more variable interaction and more complex the model. This will impact which predictors are looked at by each tree, and the information gain from each tree to the ensemble model. In the left model, which has 0.1 for both parameters, the low values lead the model to be very conservative and that only the main predictors show up. The top variables will remain important, but there will be diversity with other variables showing up in shallower trees. This will lead to the slopes of top predictors being higher than worse predictors, but worse predictors will still show up and have some slope. In the second model with the higher parameters, since the trees have more interaction depth, there is a higher learning rate and the model will quickly adapt to focusing on the important predictors. So the slope of predictor importance will be steeper, as there will be higher scores for the crucial predictors, and lower scores for the un-important ones.

——————————————————————————–

(8.7) Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Load chemical data

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

################################################################################
# Impute missing values
CMP_impute <- ChemicalManufacturingProcess %>%
  preProcess("knnImpute")

CMP_complete <- predict(CMP_impute, ChemicalManufacturingProcess)

################################################################################
# Remove near zero Variance predictors
CMP_complete <- CMP_complete[, -nearZeroVar(CMP_complete)]

################################################################################
# Train Test Split
set.seed(12345)

index <- createDataPartition(CMP_complete$Yield, p = .8, list = FALSE)

# Train
train_predictors <- CMP_complete[index, -1]  
train_yield <- CMP_complete[index, 1]

# Test
test_predictors <- CMP_complete[-index, -1]
test_yield <- CMP_complete[-index, 1]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

set.seed(12345)

rf_model <- randomForest(x = train_predictors,
                         y = train_yield,
                         importance = TRUE,
                         ntree = 500)
 
plot(rf_model)

################################################################################
# Variable Importance
varImpPlot(rf_model, main = "Random Forest Variable Importance")

################################################################################
# Predict on Test Set
rf_predictions <- predict(rf_model, test_predictors)

################################################################################
# Evaluate Performance
rf_results <- postResample(pred = rf_predictions, obs = test_yield)

print(rf_results)

##      RMSE  Rsquared       MAE 
## 0.5030486 0.7814039 0.4239256

Cubist

set.seed(12345)

cubist_model <- cubist(x = train_predictors, 
                       y = train_yield, 
                       # Number of decision trees
                       committees = 10)

################################################################################
# Predict on Test Set
cubist_predictions <- predict(cubist_model, test_predictors)

################################################################################
# Evaluate Performance
cubist_results <- postResample(pred = cubist_predictions, obs = test_yield)

print(cubist_results)

##      RMSE  Rsquared       MAE 
## 0.7271022 0.5338601 0.5225497

################################################################################
# Variable Importance 
cubist_imp <- varImp(cubist_model, scale = FALSE)

# Transform data for easier plotting
cubist_imp_df <- data.frame(Variable = rownames(cubist_imp), Importance = cubist_imp$Overall)

ggplot(cubist_imp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = '#228B22') +
  coord_flip() +
  labs(title = "Cubist Variable Importance")

Regression Tree (CART)

set.seed(12345)

rpart_model <- rpart(train_yield ~ .,          
                     data = train_predictors,  
                     method = "anova",         
                     control = rpart.control(minsplit = 20, cp = 0.01))

################################################################################
# Plot 
rpart.plot(rpart_model, main = "Regression Tree", type = 3, extra = 1)

################################################################################
# Variable Importance
rpart_imp <- varImp(rpart_model, scale = FALSE)

# Convert to data frame for easier plotting
rpart_imp_df <- data.frame(Variable = rownames(rpart_imp),Importance = rpart_imp$Overall)


ggplot(rpart_imp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = '#228B22') +
  coord_flip() +
  labs(title = "Regression Tree Variable Importance")

################################################################################
# Predict on Test Set
rpart_predictions <- predict(rpart_model, newdata = test_predictors)

################################################################################
# Evaluate Performance
rpart_results <- postResample(pred = rpart_predictions, obs = test_yield)

print(rpart_results)

##      RMSE  Rsquared       MAE 
## 0.7944217 0.4296448 0.6061872

Gradient Boosted Trees

set.seed(12345)

gbm_model <- gbm(
  formula = train_yield ~ .,    # Target variable ~ predictors
  data = data.frame(train_yield, train_predictors),  # Combine target and predictors
  distribution = "gaussian",    # Regression model (for numeric outcomes)
  n.trees = 500,                # Number of boosting iterations 
  interaction.depth = 4,        # Maximum depth 
  shrinkage = 0.01,             # Learning rate
  cv.folds = 5,                 # Cross-validation folds
  n.minobsinnode = 10           # Minimum number of observations in each leaf node
)

################################################################################
# Variable Importance
summary(gbm_model, cBars = 10, main = "Gradient Boosted Tree Variable Importance")

##                                           var     rel.inf
## ManufacturingProcess32 ManufacturingProcess32 26.88852835
## ManufacturingProcess13 ManufacturingProcess13  7.36140368
## ManufacturingProcess17 ManufacturingProcess17  6.58807977
## ManufacturingProcess09 ManufacturingProcess09  4.73966960
## ManufacturingProcess06 ManufacturingProcess06  4.68255796
## BiologicalMaterial12     BiologicalMaterial12  4.37093010
## BiologicalMaterial11     BiologicalMaterial11  3.33597023
## BiologicalMaterial03     BiologicalMaterial03  3.19574098
## ManufacturingProcess31 ManufacturingProcess31  2.29546401
## BiologicalMaterial05     BiologicalMaterial05  1.85146803
## BiologicalMaterial09     BiologicalMaterial09  1.81915688
## ManufacturingProcess20 ManufacturingProcess20  1.69770405
## ManufacturingProcess05 ManufacturingProcess05  1.65776671
## BiologicalMaterial06     BiologicalMaterial06  1.46363313
## BiologicalMaterial04     BiologicalMaterial04  1.38528019
## BiologicalMaterial08     BiologicalMaterial08  1.37260374
## ManufacturingProcess04 ManufacturingProcess04  1.33963576
## ManufacturingProcess01 ManufacturingProcess01  1.31274887
## ManufacturingProcess27 ManufacturingProcess27  1.26442216
## ManufacturingProcess19 ManufacturingProcess19  1.18391546
## ManufacturingProcess25 ManufacturingProcess25  1.08642647
## ManufacturingProcess39 ManufacturingProcess39  1.05584652
## ManufacturingProcess29 ManufacturingProcess29  0.94160760
## ManufacturingProcess18 ManufacturingProcess18  0.92274638
## BiologicalMaterial02     BiologicalMaterial02  0.91959488
## ManufacturingProcess15 ManufacturingProcess15  0.90234025
## ManufacturingProcess37 ManufacturingProcess37  0.88977417
## ManufacturingProcess11 ManufacturingProcess11  0.87523234
## ManufacturingProcess24 ManufacturingProcess24  0.87387221
## ManufacturingProcess28 ManufacturingProcess28  0.83042229
## ManufacturingProcess43 ManufacturingProcess43  0.78584359
## ManufacturingProcess30 ManufacturingProcess30  0.78557415
## ManufacturingProcess34 ManufacturingProcess34  0.77490011
## ManufacturingProcess26 ManufacturingProcess26  0.74368945
## ManufacturingProcess33 ManufacturingProcess33  0.63775581
## ManufacturingProcess21 ManufacturingProcess21  0.62688217
## ManufacturingProcess14 ManufacturingProcess14  0.61900641
## ManufacturingProcess02 ManufacturingProcess02  0.60488709
## ManufacturingProcess22 ManufacturingProcess22  0.53547902
## BiologicalMaterial10     BiologicalMaterial10  0.53437187
## BiologicalMaterial01     BiologicalMaterial01  0.53057577
## ManufacturingProcess35 ManufacturingProcess35  0.49067058
## ManufacturingProcess36 ManufacturingProcess36  0.48798781
## ManufacturingProcess03 ManufacturingProcess03  0.47316357
## ManufacturingProcess16 ManufacturingProcess16  0.38967425
## ManufacturingProcess23 ManufacturingProcess23  0.36786401
## ManufacturingProcess45 ManufacturingProcess45  0.35306249
## ManufacturingProcess07 ManufacturingProcess07  0.30702903
## ManufacturingProcess10 ManufacturingProcess10  0.22333866
## ManufacturingProcess44 ManufacturingProcess44  0.17306208
## ManufacturingProcess42 ManufacturingProcess42  0.16807405
## ManufacturingProcess08 ManufacturingProcess08  0.11183145
## ManufacturingProcess38 ManufacturingProcess38  0.10941490
## ManufacturingProcess40 ManufacturingProcess40  0.06131891
## ManufacturingProcess12 ManufacturingProcess12  0.00000000
## ManufacturingProcess41 ManufacturingProcess41  0.00000000

################################################################################
# Predict on Test Set
gbm_predictions <- predict(gbm_model, newdata = test_predictors, n.trees = gbm_model$n.trees)

################################################################################
# Evaluate Performance
gbm_results <- postResample(pred = gbm_predictions, obs = test_yield)

print(gbm_results)

##      RMSE  Rsquared       MAE 
## 0.5037441 0.7513914 0.3915591

See results

Results <- data.frame(
  Model = c("Random Forest", "Cubist", "Regression Tree (CART)", "Gradient Boosted Trees"),
  RMSE = c(rf_results["RMSE"], cubist_results["RMSE"], rpart_results["RMSE"], gbm_results["RMSE"]),
  MAE = c(rf_results["MAE"], cubist_results["MAE"], rpart_results["MAE"], gbm_results["MAE"]),
  Rsquared = c(rf_results["Rsquared"], cubist_results["Rsquared"],  rpart_results["Rsquared"],  gbm_results["Rsquared"])
)

print(Results)

##                    Model      RMSE       MAE  Rsquared
## 1          Random Forest 0.5030486 0.4239256 0.7814039
## 2                 Cubist 0.7271022 0.5225497 0.5338601
## 3 Regression Tree (CART) 0.7944217 0.6061872 0.4296448
## 4 Gradient Boosted Trees 0.5037441 0.3915591 0.7513914

The optimal model is the Random Forest, followed closely by Gradient Boosting Trees. It has the lowest RMSE of 0.5030486 and highest R^2 of 0.7814039, which is slightly better than Gradient Boosting which has RMSE of 0.5037441 and R^2 of 0.7513914.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

# Variable Importance
varImpPlot(rf_model, main = "Random Forest Variable Importance")

The optimal predictors are MP32 (Manufacturing Process 32), MP13, BM03 (Biological Material 03), BM11, MP17, MP09, BM12, BM06, etc. This is incredibly similar to the variable importance scores I obtained when using the optimal SVM model in the prior assignment. The random forest model actually seems to be skewed towards Manufacturing Processes, but there are 8 Biological Material Predictors in the top 18 predictors, albeit with lower importance.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Prune to find optimal tree

library(rattle)
library(rpart.plot)

# Find optimal complexity parameter 
best_cp <- rpart_model$cptable[which.min(rpart_model$cptable[,"xerror"]), "CP"]

# Prune the tree
optimal_tree <- prune(rpart_model, cp = best_cp)

################################################################################
# Plot optimal tree 
fancyRpartPlot(optimal_tree, 
               sub = "Optimal Regression Tree",
               palettes = c("Greens"), 
               caption = NULL)

# Save fancyRpartPlot to PNG
png("optimal_tree_fancy.png", width = 1200, height = 900)

fancyRpartPlot(optimal_tree, main = "Optimal Regression Tree with Yield Distribution")

dev.off()

## png 
##   2

The optimal predictor shows to be BMP32, followed by BM11, MP17 in the 2nd-level nodes. The third-level nodes are BM05, MP25, MP09, followed by smaller-level predictors. This optimal tree visualization helps to show how all the different predictors <, >, <=, or >= some specific value, as well as the distribution of certain Yield values.

DATA 624 RB HW9

Ron Balaban

2025-03-13

——————————————————————————–

(8.1) Recreate the simulated data from Exercise 7.2:

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

——————————————————————————–

(8.2) Use a simulation to show tree bias with different granularities.

——————————————————————————–

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

(b) Which model do you think would be more predictive of other samples?

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

——————————————————————————–

(8.7) Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

(a) Which tree-based regression model gives the optimal resampling and test set performance?

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?