Data 624 Homework 8

library(tidyverse)
library(ggrepel)
library(fabletools)
library(corrplot)
library(GGally)
library(patchwork)
library(caret)
library(pls)
library(impute)
library(mlbench)
library(randomForest)
library(party)
library(rpart)
library(rpart.plot)
library(Cubist)
library(gbm)
library(RWeka)
library(ipred)
library(AppliedPredictiveModeling)
theme_set(theme_bw())

Problem 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated = mlbench.friedman1(200, sd = 1) 
simulated = cbind(simulated$x, simulated$y) 
simulated = as.data.frame(simulated) 
colnames(simulated)[ncol(simulated)] = "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 = varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rfImp1 |> arrange(desc(Overall))

Based on the output above, the random forest model does not significantly use the uninformative predictors. The most important of these uninformative predictors (V6) is only about 20% of the least important informative predictor (V3).

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 = simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 = varImp(model2, scale = FALSE)
rfImp2 |> arrange(desc(Overall))

Based on the output above, the importance score of V1 in the new random forest model dropped significantly, decreasing by almost 50%.

simulated$duplicate2 = simulated$V1 - rnorm(200)*0.05
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9851041

model3 = randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 = varImp(model3, scale = FALSE)
rfImp3 |> arrange(desc(Overall))

After adding a second highly correlated predictor with V1, the importance score of V1 in the new random forest model did not change much, decreasing by only about 0.11 from the previous model where there was only one correlated predictor.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cforest_model1 = cforest(y ~ ., data = simulated |> select(-duplicate1, -duplicate2),
                        controls = cforest_control(ntree = 1000))
rfImp_cforest1 = varImp(cforest_model1)
rfImp_cforest1 |> arrange(desc(Overall))

cforest_model2 = cforest(y ~ ., data = simulated |> select(-duplicate1),
                        controls = cforest_control(ntree = 1000))
rfImp_cforest2 = varImp(cforest_model2)
rfImp_cforest2 |> arrange(desc(Overall))

cforest_model3 = cforest(y ~ ., data = simulated,
                        controls = cforest_control(ntree = 1000))
rfImp_cforest3 = varImp(cforest_model3)
rfImp_cforest3 |> arrange(desc(Overall))

The 3 models above using the cforest method use none, one of, and both of the correlated predictors, respectively. In general, these models follow a similar pattern to those generated from the randomForest method, where adding a correlated predictor decreases the V1 importance by about 50%, and adding a second correlated predictor further decreases it. However, with the cforest method, the importance decreases by another 50%, which is much more significant than with the randomForest method.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

cubist_model1 = cubist(x = simulated |> select(-duplicate1, -duplicate2, -y),
                       y = simulated$y, controls = cubistControl(ntree = 1000),
                       committees = 100)
rfImp_cubist1 = varImp(cubist_model1)
rfImp_cubist1 |> arrange(desc(Overall))

cubist_model2 = cubist(x = simulated |> select(-duplicate1, -y),
                       y = simulated$y, controls = cubistControl(ntree = 1000),
                       committees = 100)
rfImp_cubist2 = varImp(cubist_model2)
rfImp_cubist2 |> arrange(desc(Overall))

cubist_model3 = cubist(x = simulated |> select(-y),
                       y = simulated$y, controls = cubistControl(ntree = 1000),
                       committees = 100)
rfImp_cubist3 = varImp(cubist_model3)
rfImp_cubist3 |> arrange(desc(Overall))

When using the Cubist method, the importance of V1 is highest without the highly correlated values at a value of 71.5. This importance value decreases by roughly 40% once a highly correlated predictor is added. This behavior is similar to the previous 2 random forest methods used. However, unlike the 2 random forest models, once the second highly correlated predictor is added, the importance of V1 actually increases.

boosted_model1 = gbm(y ~ ., data = simulated |> select(-duplicate1, -duplicate2),
                     distribution = "gaussian")
summary(boosted_model1)

boosted_model2 = gbm(y ~ ., data = simulated |> select(-duplicate1),
                     distribution = "gaussian")
summary(boosted_model2)

boosted_model3 = gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(boosted_model3)

When using the gradient boosted method, the importance of V1 is highest without the highly correlated values at a value of about 26.6. This importance value decreases by roughly 28% once a highly correlated predictor is added, and then an additional 61% once the second highly correlated predictor is added. Although the magnitudes of the decrease are different than the random forests, the general behavior of the models when adding correlated predictors is similar.

Problem 8.2

Use a simulation to show tree bias with different granularities.

For the simulation, a dataframe is created with 2 predictors of random data, and one dependent variable. This data is created at random in a loop 1000 times, where a new tree is generated from each new dataframe. From each tree, the variable used for the first split is extracted.

set.seed(3981)
first_split = numeric(1000)

for (i in 1:1000) {
  simulate2 = data.frame(x1 = sample(1:500, size = 500),
                       x2 = sample(c("A", "B", "C"), size = 500, replace = TRUE),
                       y = rnorm(500, mean = 50, sd = 5))

  simulate2_tree = rpart(y ~ ., data = simulate2, 
             control = rpart.control(cp = 0, minsplit = 2, maxdepth = 5))
  first_split[i] = as.character(simulate2_tree$frame$var[1])
}

as.data.frame(first_split) |> group_by(first_split) |> count()

Looking at the results above, all of the 1000 simulations have the first split based on x1, which consists of random integers from 1-500. This shows that, despite both predictors and the dependent variable being made up of random values, the tree model will typically choose the variable with more unique values, as opposed to a predictor with a few discrete values. This shows the bias that tree models tend to have.

Problem 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

**The model on the right focuses on just a few predictors because, with the bagging fraction being high, more of the training set is selected, which reduces randomness and makes it more likely that a few predictors will dominate when building the model. Additionally, the high learning rate means that it behaves more like a simple gradient boosting model, which has the tendancy to select the optimal weak learner at each stage, thus putting more importance on a smaller set of variables.**

(b) Which model do you think would be more predictive of other samples?

**I would think that the model on the left with a bagging fraction and learning rate of 0.1 would be more predictive of other samples. This is because the model on the left is less likely to be biased by a few predictors.**

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

**I would expect that increasing interaction depth would cause the slope of the predictor importance to increase. This is because the increased interaction depth allows trees to utilize strong predictors more when making additional splits.**

Problem 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

set.seed(48103)
data(ChemicalManufacturingProcess)
updated_chem = impute.knn(as.matrix(ChemicalManufacturingProcess))$data
updated_chem = as.data.frame(updated_chem)
anyNA(updated_chem)

## [1] FALSE

train_chem_rows = createDataPartition(updated_chem$Yield, p = 0.75, list = FALSE)

train_chem = updated_chem[train_chem_rows,]
test_chem = updated_chem[-train_chem_rows,]

train_chem_x = train_chem |> select(-Yield)
train_chem_y = train_chem$Yield
test_chem_x = test_chem |> select(-Yield)
test_chem_y = test_chem$Yield

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Basic Regression Tree

basic_grid = expand.grid(maxdepth = 1:20)
basic_chem_model = train(train_chem_x, train_chem_y,
                         method = "rpart2",
                         tuneLength = 10,
                         trControl = trainControl(method = "cv", number = 10),
                         tuneGrid = basic_grid)

pred_basic_tree = predict(basic_chem_model, test_chem |> select(-Yield))
basic_tree_values = data.frame(obs = test_chem$Yield, pred = pred_basic_tree)
defaultSummary(basic_tree_values)

##     RMSE Rsquared      MAE 
## 1.506337 0.344549 1.175471

Regression Model Tree

reg_chem_model = train(train_chem_x, train_chem_y,
                         method = "M5",
                         trControl = trainControl(method = "cv", number = 10),
                       control = Weka_control(M = 10))

pred_reg_tree = predict(reg_chem_model, test_chem |> select(-Yield))
reg_tree_values = data.frame(obs = test_chem$Yield, pred = pred_reg_tree)
defaultSummary(reg_tree_values)

##      RMSE  Rsquared       MAE 
## 1.2949062 0.5309172 1.0201474

Rule-Based Model Tree

rule_chem_model = train(train_chem_x, train_chem_y,
                         method = "M5Rules",
                         trControl = trainControl(method = "cv", number = 10),
                       control = Weka_control(M = 10))

pred_rule_tree = predict(rule_chem_model, test_chem |> select(-Yield))
rule_tree_values = data.frame(obs = test_chem$Yield, pred = pred_rule_tree)
defaultSummary(rule_tree_values)

##      RMSE  Rsquared       MAE 
## 1.8398825 0.2158957 1.3576727

Bagged Trees

bag_chem_model = train(train_chem_x, train_chem_y,
                         method = "treebag",
                         trControl = trainControl(method = "cv", number = 10))

pred_bag_tree = predict(bag_chem_model, test_chem |> select(-Yield))
bag_tree_values = data.frame(obs = test_chem$Yield, pred = pred_bag_tree)
defaultSummary(bag_tree_values)

##      RMSE  Rsquared       MAE 
## 1.3441978 0.4455488 0.9682164

Random Forest

rf_grid = expand.grid(mtry = c(2, 4, 8, 16))
rf_chem_model = train(train_chem_x, train_chem_y,
                         method = "rf",
                         trControl = trainControl(method = "cv", number = 10),
                      tuneGrid = rf_grid)

pred_rf_tree = predict(rf_chem_model, test_chem |> select(-Yield))
rf_tree_values = data.frame(obs = test_chem$Yield, pred = pred_rf_tree)
defaultSummary(rf_tree_values)

##      RMSE  Rsquared       MAE 
## 1.2675650 0.5172413 0.9270131

Boosting

gbm_grid = expand.grid(.interaction.depth = seq(1, 7, by = 2),
                       .n.trees = seq(100, 1000, by = 50),
                       .shrinkage = c(0.01, 0.1),
                       .n.minobsinnode = c(5, 10, 20))

boost_chem_model = train(train_chem_x, train_chem_y,
                         method = "gbm",
                         tuneGrid = gbm_grid,
                         verbose = FALSE)

pred_boost_tree = predict(boost_chem_model, test_chem |> select(-Yield))
boost_tree_values = data.frame(obs = test_chem$Yield, pred = pred_boost_tree)
defaultSummary(boost_tree_values)

##      RMSE  Rsquared       MAE 
## 1.2050363 0.5576808 0.8876818

Cubist

cubist_grid <- expand.grid(committees = c(1, 10, 20, 50),
                           neighbors = c(0, 1, 5, 9))

cubist_chem_model = train(train_chem_x, train_chem_y,
                         method = "cubist",
                         tuneGrid = cubist_grid,
                         trControl = trainControl(method = "cv"))

pred_cubist_tree = predict(cubist_chem_model, test_chem |> select(-Yield))
cubist_tree_values = data.frame(obs = test_chem$Yield, pred = pred_cubist_tree)
defaultSummary(cubist_tree_values)

##      RMSE  Rsquared       MAE 
## 1.1035163 0.6453897 0.7579079

Based on the models above, the 2 models that are very comparable are the boosted and cubist models. Both have an RMSE value around 1.15 and R^2 value of about 0.6. Since these metrics are both very similar, the MAE is looked at to get the final optimal model. The cubist model has a lower MAE at about 0.797, compared to an MAE of 0.837 for the bossted model. As a result, the cubist method is considered the most optimal method.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(cubist_chem_model), top = 10,
     main = "Top 10 Most Important Predictors in Cubist Model")

Looking at the plot above, 6 of the top 10 predictors in terms of importance are process variables, with the top 3 being Manufacturing Process 32, 17, and 01. The 2 plots below show the importance plots that were produced from the Ridge Regression model and SVM Regression model. All 3 models have Manufacturing Process 32 within the top 2 for most important predictors. All of them also have Manufacturing Process 17, Manufacturing Process 09, and Manufacturing Process 13 within the top 10 most important predictors. Additionally, they all share Biological Material 06 and Biological Material 12 in the top 10 most important predictors. The only other predictor shared with the cubist model is Biological Material 02, which is only present in the top 10 for the linear model. The remaining 3 predictors in the top 10 of importance within the cubist model are unique when compared with the other 2 models.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

basic_chem_model$bestTune$maxdepth

## [1] 8

optimal_basic = rpart(Yield ~ ., data = train_chem,
                      control = rpart.control(maxdepth = 7))
pred_optimal_basic = predict(optimal_basic, test_chem |> select(-Yield))
optimal_basic_values = data.frame(obs = test_chem$Yield, pred = pred_optimal_basic)
defaultSummary(optimal_basic_values)

##     RMSE Rsquared      MAE 
## 1.506337 0.344549 1.175471

rpart.plot(optimal_basic)

This view can help visualize what predictors are most important within the model, especially when looking at nodes towards the top of the tree. For example, in the above model, the first split is based on ManufacturingProcess32, suggesting it’s a very important predictor within the model. This is verified when looking at the previous 3 models generated for this dataset, where ManufacturingProcess32 was a top 2 predictor in terms of importance. This view can also show what values are considered critical in determining the yield. For example, for ManufacturingProcess19, the critical value is 6042, suggesting that this is a key value to aim for when modifying the process.