Question 8.1

Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

# simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
# cor(simulated$duplicate1, simulated$V1)

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Question 8.1a Answer

rfImp1 %>%
  arrange(desc(Overall))

##          Overall
## V1   8.732235404
## V4   7.615118809
## V2   6.415369387
## V5   2.023524577
## V3   0.763591825
## V6   0.165111172
## V7  -0.005961659
## V10 -0.074944788
## V9  -0.095292651
## V8  -0.166362581

The output shown above shows us that the uninformative predictors don’t really contribute much to the random forest model. In fact, V7 to V10 contribute negatively to the model.

Question 8.1b Answer

set.seed(200)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1 
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9497025

The output above shows us that duplicate1 and V1 have a very high correlation value of 0.95. Now let’s fit a random forest model and see the variable importance plot again with this new duplicate1 variable.

model2 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)

rfImp2 %>%
  arrange(desc(Overall))

##                Overall
## V4          6.86363294
## V2          6.05937899
## V1          6.00709780
## duplicate1  4.43323167
## V5          2.19939891
## V3          0.58465293
## V6          0.10898039
## V10         0.09999339
## V9          0.06123662
## V7          0.06104207
## V8         -0.04059204

The importance for V1 shifted downward from 1st place to 3rd place. with duplicate1 coming right after in 4th place based on the output above. Now let’s add another predictor that is also highly correlated with V1.

set.seed(100)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .15 
cor(simulated$duplicate2, simulated$V1)

## [1] 0.8998308

The output above shows us that duplicate2 and V1 have a correlation of 0.9, which is very high. Let’s fit a random forest model to the modified dataset.

model3 <- randomForest(y ~ ., 
                       data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)

rfImp3 %>%
  arrange(desc(Overall))

##                 Overall
## V4          7.254235714
## V2          6.584371908
## V1          5.760644106
## duplicate1  3.356974030
## V5          2.194899926
## duplicate2  1.830541809
## V3          0.490851860
## V6          0.171059467
## V10         0.013491410
## V8         -0.001042333
## V7         -0.047423191
## V9         -0.100834810

The output above shows us that V1 is still in 3rd place and duplicate1 is still in 4th place, with duplicate2 behind duplicate1 by 2 places. duplicate2 has roughly half the importance of duplicate1. This might be because duplicate1 is slightly more correlated to V1 than duplicate2.

Question 8.1c Answer

cforestModel <- cforest(
  y ~ .,
  data = simulated
)

cforestImpTrad <- varimp(cforestModel, conditional = FALSE) %>% sort(decreasing = TRUE)
cforestImpStrobl <- varimp(cforestModel, conditional = TRUE) %>% sort(decreasing = TRUE)

cforestImpTrad

##          V4          V1          V2  duplicate1  duplicate2          V5 
##  6.49466579  6.49080949  5.55767891  4.33895836  2.75129426  2.01489687 
##          V7          V3          V9          V8          V6         V10 
##  0.22587739  0.19886434  0.04520632  0.03562828  0.03543652 -0.30228330

cforestImpStrobl

##            V4            V2            V1    duplicate1            V5 
##  5.7081811063  4.7418228149  3.3435137816  1.8499787494  1.4786681242 
##    duplicate2            V3            V6            V7            V8 
##  0.7587727058  0.0736672033  0.0557837900  0.0003579684 -0.1045100995 
##            V9           V10 
## -0.1902618378 -0.4713946121

When using the traditional importance measure, V1 and V2 switched positions, while duplicate2 and V5 also split positions. Also notice that V6 and V3 have switched positions for both importance measures. This indicates that one of the uninformative predictors, V6 is above an informative predictor, V3.

Question 8.1d Answer

Boosted Trees

Plotting importance from GBM model found in this Stack Overflow answer.

gbmModel <- gbm(y ~ ., data = simulated, distribution = 'gaussian')   
summary.gbm(gbmModel)

##                   var    rel.inf
## V4                 V4 30.6526712
## V2                 V2 24.3167512
## V1                 V1 21.5777329
## V5                 V5 10.6053022
## V3                 V3  7.5204037
## duplicate1 duplicate1  2.9138287
## duplicate2 duplicate2  1.8524757
## V7                 V7  0.1596372
## V6                 V6  0.1444553
## V9                 V9  0.1408557
## V10               V10  0.1158861
## V8                 V8  0.0000000

The output above shows us that V4, V2 and V1 are the top 3 most important predictors. V6 to V10 have little to no importance. Also note that while duplicate1 and duplicate2 have importance because of the high correlation with V1, duplicates 1 and 2 are on the low end of the importance hierarchy, just right below the uninformative predictors.

Cubist

cubistTuned <- train(y ~ ., data = simulated, method = "cubist")
varImp(cubistTuned)

## cubist variable importance
## 
##            Overall
## V1          100.00
## V2           79.17
## V4           68.06
## V3           58.33
## V5           52.08
## V6           25.69
## duplicate2    0.00
## V9            0.00
## V8            0.00
## V10           0.00
## duplicate1    0.00
## V7            0.00

Very interestingly, the output above shows us that V1 to V6 are the most important predictors, while the rest of the predictors have no importance to the cubist model. V6 is an uninformative predictor and has some importance in the model, and duplicate1 and duplicate2 have no importance, which is a result of these two variables conveying almost the same information as V1.

Question 8.2

Use a simulation to show tree bias with different granularities.

Question 8.2 Answer

Page 182 in the Applied Predictive Modeling textbook says the following about regression trees:

“Finally, these trees suﬀer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010).”

We should create a dataset which contains variables with increasing numbers of distinct values in order to illustrate this characteristic of regression trees.

(Tutorial used to create vector of random samples)

set.seed(100)

x1 <- sample(0:1, size = 200, replace = TRUE)
x2 <- sample(0:1000, size = 200, replace = FALSE)
x3 <- sample(0:100000, size = 200, replace = FALSE)
y <- sample(0:1000, size = 200, replace = TRUE)

sample_df <- data.frame(x1, x2, x3, y)

Now that we have our data, let’s fit an Single Tree model to the dataset and check the variable importance using the varImp function.

set.seed(100)
rpartTune <- train(y ~ ., data = sample_df,
                method = "rpart2",
                trControl = trainControl(method = "cv"),
                tuneLength = 10)

## note: only 8 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 8 .

varImp(rpartTune)

## rpart2 variable importance
## 
##    Overall
## x3  100.00
## x2   83.85
## x1    0.00

The output above shows us that x3, which was took 200 values from 0 to 100000 without replacement has the highest importance, which is to be expected. Followed by x2 with 0 to 1000 values without replacement, followed lastly with x1, which takes 200 values, and these values can only be 0 or 1. Knowing this, x1 has an importance of 0, which is to be expected, since it only has 2 distinct values.

Question 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters aﬀect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the ﬁrst few of predictors, whereas the model on the left spreads importance across more predictors?
Which model do you think would be more predictive of other samples?
How would increasing interaction depth aﬀect the slope of predictor importance for either model in Fig. 8.24?

Question 8.3a Answer

The model on the right focuses its importance on just the first few predictors because the bagging fraction is higher, so a larger subset of the predictors are used in each regression tree. Therefore, the highly important predictors will appear more often in the iterations. So many of the same predictors are selected across the trees, increasing their contribution to the importance metric.The model on the left spreads its importance across more predictors because the bagging fraction is 0.1 instead of 0.9, thereby decreasing the chances that a predictor will appear twice between trees generated in the iterative fashion that stochastic gradient boosting is known for.

The lower the learning rate, the less the effect the predictive value from the previous iteration is going to have on the current prediction. This allows for less important predictors to have more of an effect on the model. So this ultimately explains why the importance profile for the model on the left is more spread out than the importance model on the right.

Question 8.3b Answer

Generally speaking, we would prefer a model that isn’t prone to overfitting. I feel that with the model on the right, the model is fit too closely to the original training data, while the model on the left, which has it’s importance profile more spread out, would be less prone to overfitting, quite possibly allowing for a lower RMSE if we we’re to evaluate both models with a test set.

Question 8.3c Answer

Page 205 in the Applied Predictive Modeling textbook explains the following about interaction depth:

“When regression tree are used as the base learner, simple gradient boosting for regression has two tuning parameters: tree depth and number of iterations. Tree depth in this context is also known as interaction depth, since each subsequential split can be thought of as a higher-level interaction term with all of the other previous split predictors.”

Therefore, if we were to increase the interaction depth, we would effectively have more nodes for each of the trees. Having more nodes would give more chances for the less important predictors to pop up more between iterations. So the slopes would be less steeper as we increase the interaction depth.

Question 8,7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Which tree-based regression model gives the optimal resampling and test set performance?
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Question 8.7 Model Fitting

Question 6.3 parts a, b, and c were performed in a previous homework and will be replicated here for the data imputation, data splitting, and pre-processing steps.

Question 8.7 Data Pre-Processing

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

impute <- ChemicalManufacturingProcess %>%
  preProcess("knnImpute")

imputed <- predict(impute, ChemicalManufacturingProcess)

X <- imputed %>% select(-Yield)
y = imputed$Yield

high_freq_predictors <- X[,-nearZeroVar(X)]
dim(high_freq_predictors)

## [1] 176  56

We create the training and test set in the code chunk below.

set.seed(1)
sample <- sample.split(y, SplitRatio = 0.8)

train_X  <- subset(high_freq_predictors, sample == TRUE) %>% as.matrix()
test_X   <- subset(high_freq_predictors, sample == FALSE) %>% as.matrix()
train_y <- subset(y, sample == TRUE) %>% as.vector()
test_y <- subset(y, sample == FALSE) %>% as.vector()

chemdata <- cbind(train_X, train_y)
chemdata <- as.data.frame(chemdata)
colnames(chemdata)[ncol(chemdata)] <- "y"

Question 8.7 Single Trees Model

set.seed(100)
rpartTune <- train(train_X, train_y,
                   method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))

Question 8.7 Random Forest Model

rfModel <- randomForest(y ~ .,
                       chemdata,
                       importance = TRUE,
                       ntree = 1000)

Question 8.7 Boosted Trees Model

gbmModel <- gbm(y ~ ., data = chemdata, distribution = 'gaussian')

Question 8.7 Cubist Model

cubistTuned <- train(y ~ ., data = chemdata, method = "cubist")

Question 8.7a Answer

rpartPred <- predict(rpartTune, newdata = test_X)
postResample(pred = rpartPred, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.7830512 0.4084461 0.6230981

rfPred <- predict(rfModel, newdata = test_X)
postResample(pred = rfPred, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.5913957 0.5515540 0.4206182

gbmPred <- predict(gbmModel, newdata = data.frame(test_X))

## Using 100 trees...

postResample(pred = gbmPred, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.6612517 0.4709783 0.5090027

cubistPred <- predict(cubistTuned, newdata = data.frame(test_X))
postResample(pred = cubistPred, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.5868265 0.5714771 0.4384501

The output above shows us that the Cubist model gives us the optimal resampling and test set performance. Notice that the R-squared is 0.57, which means that the model does slightly better than random guessing, as opposed to the Boosted Trees model, which gives an R-squared of 0.46.

Question 87.b Answer

varImp(cubistTuned)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess31  45.946
## ManufacturingProcess09  43.243
## ManufacturingProcess13  42.342
## ManufacturingProcess04  37.838
## ManufacturingProcess33  36.937
## ManufacturingProcess17  30.631
## BiologicalMaterial03    27.928
## BiologicalMaterial02    26.126
## ManufacturingProcess37  25.225
## ManufacturingProcess28  19.820
## ManufacturingProcess29  18.919
## ManufacturingProcess25  18.919
## ManufacturingProcess27  13.514
## BiologicalMaterial09    13.514
## BiologicalMaterial06    11.712
## ManufacturingProcess30   9.910
## BiologicalMaterial08     9.910
## BiologicalMaterial11     9.910
## BiologicalMaterial12     9.009

The output above shows us the top 20 predictors with the highest overall importance. Let’s focus on just the first 10. As we can see from the output above, ManufacturingProcess32 has an importance of 100, followed by a drop of less than half to 45.95 for ManufacturingProcess31.Out of the top 10, there are only 2 biological processes, which indicates that manufacturing processes dominate the top 10.

For comparisons sake, I have provided the SVM model from Chapter 7 and the PLS model from Chapter 6 as shown below.

svmRTuned <- train(train_X, train_y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 125, 128, 127, 125, 128, 125, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7893858  0.5054155  0.6470156
##      0.50  0.7240074  0.5453873  0.5959249
##      1.00  0.6622741  0.5886838  0.5427651
##      2.00  0.6263311  0.6334186  0.5091241
##      4.00  0.6139378  0.6515314  0.4969291
##      8.00  0.6245266  0.6412209  0.5079059
##     16.00  0.6237248  0.6421324  0.5080469
##     32.00  0.6237248  0.6421324  0.5080469
##     64.00  0.6237248  0.6421324  0.5080469
##    128.00  0.6237248  0.6421324  0.5080469
##    256.00  0.6237248  0.6421324  0.5080469
##    512.00  0.6237248  0.6421324  0.5080469
##   1024.00  0.6237248  0.6421324  0.5080469
##   2048.00  0.6237248  0.6421324  0.5080469
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01440088
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01440088 and C = 4.

svmRPred <- predict(svmRTuned, newdata = test_X)
postResample(pred = svmRPred, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.5987115 0.5605562 0.4612658

varImp(svmRTuned)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   98.03
## BiologicalMaterial06     86.76
## ManufacturingProcess09   84.29
## BiologicalMaterial03     79.41
## ManufacturingProcess17   79.32
## ManufacturingProcess31   76.95
## BiologicalMaterial02     69.88
## BiologicalMaterial12     69.71
## ManufacturingProcess36   67.83
## ManufacturingProcess06   57.40
## ManufacturingProcess11   55.01
## BiologicalMaterial04     54.73
## BiologicalMaterial11     45.83
## ManufacturingProcess29   45.61
## BiologicalMaterial08     44.90
## BiologicalMaterial01     44.41
## ManufacturingProcess33   43.41
## ManufacturingProcess30   41.97
## BiologicalMaterial09     40.99

ctrl <- trainControl(method = "cv", number = 10)
plsTune <- caret::train(
  train_X, 
  train_y,
  method = "pls",
  tuneLength = 20,
  trControl = ctrl,
  preProc = c("center", "scale")
  )

plsTune

## Partial Least Squares 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 125, 126, 126, 127, 128, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8479010  0.4265157  0.6657416
##    2     1.0529488  0.5038503  0.7214306
##    3     0.8474292  0.5771544  0.5984535
##    4     0.9456246  0.5676970  0.6447407
##    5     1.0659868  0.5552396  0.6719471
##    6     1.1568744  0.5409098  0.6957232
##    7     1.2351712  0.5339243  0.7324440
##    8     1.3269965  0.5165428  0.7683998
##    9     1.4055688  0.5197192  0.7652225
##   10     1.5269544  0.5119143  0.7816197
##   11     1.5896791  0.5089670  0.8410855
##   12     1.6910618  0.4993789  0.8852889
##   13     1.8672296  0.4920632  0.9566055
##   14     2.0574667  0.4932648  1.0114094
##   15     2.0859090  0.5021506  1.0113019
##   16     2.1538067  0.5048579  1.0344105
##   17     2.1909744  0.5084537  1.0469477
##   18     2.2070858  0.5063554  1.0582202
##   19     2.2386613  0.5105727  1.0724065
##   20     2.2673578  0.5123984  1.0827180
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.

plsPred <- predict(plsTune, newdata = test_X)
postResample(pred = plsPred, obs = test_y)

##      RMSE  Rsquared       MAE 
## 0.6625784 0.5450552 0.5616085

varImp(plsTune)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   84.32
## ManufacturingProcess13   82.00
## ManufacturingProcess17   78.26
## ManufacturingProcess36   73.23
## ManufacturingProcess11   57.16
## BiologicalMaterial02     56.69
## BiologicalMaterial08     55.83
## BiologicalMaterial06     54.80
## BiologicalMaterial12     53.51
## ManufacturingProcess33   53.10
## ManufacturingProcess06   52.95
## BiologicalMaterial11     52.67
## BiologicalMaterial03     52.19
## BiologicalMaterial01     47.55
## BiologicalMaterial04     46.40
## ManufacturingProcess34   45.04
## ManufacturingProcess12   44.92
## ManufacturingProcess28   43.77
## ManufacturingProcess04   41.58

Through the postResample function, we can see that the SVM model from Chapter 7 performs the best out of the three models that I have presented in this part of the question in terms of test set performance. The Rsquared for the SVM model (0.593) is higher than the Rsquared for the Cubist model (0.571). Note that the Rsquared for the PLS model is less than 0.5, which indicates that the model does not do any better than random guessing and should be ignored.

ggplot(varImp(cubistTuned), top = 10)

ggplot(varImp(svmRTuned), top = 10)

The slope for the predictor importance for the SVM model is much less than the slope for the predictor importance of the Cubist model.

Question 8.7c Answer

rpartTree <- as.party(rpartTune$finalModel)
plot(rpartTree)

The plot above shows us that the yield, on average, is greater if ManufacturingProcess32 is greater than or equal to 0.192. The plot also shows us that we have on average, negative yield if ManufacturingProcess32 is less than 0.192. Also, the only two biological processes play a role only if ManufacturingProcess32 is less than 0.192.

DATA 624 - Homework 9

Peter Phung

2023-04-29