8.1

Recreate the simulated data from Exercise 7.2:

# From textbook

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.4.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated) 
colnames(simulated)[ncol(simulated)] <- "y" 

a

  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
model1 <- randomForest(y ~., data = simulated,  
                         importance = TRUE, 
                         ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE) 

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

print(rfImp1)
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

These predictors do not seem to have impacted the model significantly. While the minimum value for informative predictors was 0.76 (V3), none of the uninformative predictors had a value greater than 0.165.

b

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1  
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

The importance score for V1 decreased in model 2.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

The inclusion of another predictor also highly correlated with V1 decreased the importance score in model 3.

c

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
## Warning: package 'party' was built under R version 4.4.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.4.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.4.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.4.3
ctrl <- cforest_control(mtry = ncol(simulated) - 1)
cforest <- cforest(y ~ ., data = simulated, controls = ctrl)

cfImp <- varimp(cforest)
cfImp
##           V1           V2           V3           V4           V5           V6 
##  3.885759655  7.668577261  0.022784363 10.187144892  1.944941733  0.001131697 
##           V7           V8           V9          V10   duplicate1   duplicate2 
##  0.046943026 -0.049653633  0.018987655 -0.013785512  5.639384800  0.147354677

Yes, these importances show the same pattern as the traditional random forest model. V1-5 have greater impact than the uninformative predictors V6-10. The highlight correlated variables have a higher importance score than the uninformative variables.

d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

btree1 <- train(x = simulated[,-11], y = simulated$y, method = "bstTree")

btree1_vi <- varImp(btree1, scale = FALSE)
print(btree1_vi)
## loess r-squared variable importance
## 
##             Overall
## V4         0.289801
## duplicate1 0.277400
## V1         0.277043
## V2         0.260337
## duplicate2 0.238930
## V5         0.134314
## V3         0.090934
## V9         0.023943
## V10        0.021639
## V8         0.015209
## V6         0.008499
## V7         0.005977
cubist1 <- train(x = simulated[,-11], y = simulated$y, method = "cubist",
                  tuneLength = 10)

cubist1_vi <- varImp(cubist1, scale = FALSE)
print(cubist1_vi)
## cubist variable importance
## 
##            Overall
## V2            69.5
## V1            54.0
## V4            50.0
## V5            38.0
## V3            32.0
## duplicate2    25.0
## duplicate1    25.0
## V6            10.0
## V9             0.0
## V8             0.0
## V7             0.0
## V10            0.0

The variable importance pattern in the Cubist model seems to be different.The 5 informative predictors are the most important like in the previous models but some of the uninformative variables and duplicates had no importance to the model.

8.2

Use a simulation to show tree bias with different granularities.

library(rpart)
library(partykit)
## Warning: package 'partykit' was built under R version 4.4.3
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.4.3
## 
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
x1 <- sample(1:10 /10, 150, replace = TRUE)
x2 <- sample(1:100 /100, 150, replace = TRUE)
x3 <- sample(1:1000 /1000, 150, replace = TRUE)
x4 <- sample(1:10000 /10000, 150, replace = TRUE)

y1 <- x1 + x2 + x3 + rnorm(150, mean=0, sd=5)

simData <- data.frame(x1, x2, x3, x4, y1) 

tree_model_cont <- rpart(y1 ~ ., data = simData, method = "anova")

rpart.plot(tree_model_cont, type = 2, extra = 100, fallen.leaves = TRUE)

The predictor with the highest number of distinct values is X2 and gets split first.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has both parameters set to 0.9. This means that the majority of the training set is used to build each tree (high bagging fraction) and each tree is of high-importance (high learning rate). This results in the initial predictors having more influence on the model.

The model on the left uses only a small fraction of the training set to build each tree and each tree is of low importance which decreases how much impact the learners have on the model. This leads to a more balanced variable importance profile.

b

  1. Which model do you think would be more predictive of other samples?

I think the model on the left would more predictive of other samples. Since the model on the right has a large learning rate, there is a higher risk of overfitting.

c

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing interaction depth would result in more tree splits. This would increase complexity and cause more variables to have greater impact on either model. The slope of the predictor importance would decrease and the variable importance profile would be less steep.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

# From textbook 
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3
data(ChemicalManufacturingProcess)
# Impute missing values via knn
library(impute)
chemicals <- impute.knn(as.matrix(ChemicalManufacturingProcess), rng.seed = 123)
chemicals <- as.data.frame(chemicals$data)

# Test/train split
training_chemicals <- createDataPartition(chemicals$Yield, p = 0.8, list = FALSE)
train_predictors_chemicals <- chemicals[training_chemicals,-1]
train_yield <- chemicals[training_chemicals, 1]
test_predictors_chemicals <- chemicals[-training_chemicals,-1]
test_yield <- chemicals[-training_chemicals, 1]

# Random forest model 
chemicals_rf <- train(x = train_predictors_chemicals,
                                 y = train_yield,
                                 method = "rf",
                                 preProc = c("center", "scale"),
                                 tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
# Boosted tree 
chemicals_boost <- train(x = train_predictors_chemicals,
                         y = train_yield,
                         method = "gbm",
                         trControl = trainControl(method = "cv", number = 5),
                         tuneLength = 10,
                         verbose = FALSE)
# cubist model
chem_cubist <- train(x = train_predictors_chemicals,
                          y = train_yield,
                          method = "cubist",
                          preProc = c("center", "scale"),
                          tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07

a

  1. Which tree-based regression model gives the optimal resampling and test set performance?
rf_predict <- predict(chemicals_rf, test_predictors_chemicals)
yield_rf <- data.frame(obs = test_yield, pred = rf_predict)
print(defaultSummary(yield_rf))
##      RMSE  Rsquared       MAE 
## 1.2500682 0.5569329 0.9597679
boost_predict <- predict(chemicals_boost, test_predictors_chemicals)
yield_boosted <- data.frame(obs = test_yield, pred = boost_predict)
print(defaultSummary(yield_boosted))
##      RMSE  Rsquared       MAE 
## 1.1745569 0.5929193 0.9122479
cubist_predict <- predict(chem_cubist, test_predictors_chemicals)
yield_cubist <- data.frame(obs = test_yield, pred = cubist_predict)
print(defaultSummary(yield_cubist))
##      RMSE  Rsquared       MAE 
## 0.9977879 0.7017476 0.8410910

The random cubist model performed the best with the highest rsquared score and the lowest RMSE score.

b

  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(chem_cubist, scale = FALSE)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32    53.5
## ManufacturingProcess17    33.5
## ManufacturingProcess09    20.5
## ManufacturingProcess33    18.5
## BiologicalMaterial02      18.0
## BiologicalMaterial06      18.0
## ManufacturingProcess04    13.5
## BiologicalMaterial05      12.0
## ManufacturingProcess28    11.5
## ManufacturingProcess13    11.0
## ManufacturingProcess45    10.5
## BiologicalMaterial08      10.5
## BiologicalMaterial11      10.0
## ManufacturingProcess15     7.5
## ManufacturingProcess11     6.0
## ManufacturingProcess37     6.0
## ManufacturingProcess27     6.0
## BiologicalMaterial03       6.0
## BiologicalMaterial12       6.0
## BiologicalMaterial01       5.0

Manufacturing processes dominate the top 10 list (7/10). Like in the previous assignments, it should be noted that there are more manufacturing processes overall compared to biological material so it is not necessarily dominating by percentage.

The top manufacturing processes were 32,17,09, 33, 04, 28, and 13. The top biological material were 02,06, and 05.

In the linear model used in hw 7, the top manufacturing processes were 32,17,13, 09, 36, 06, and 33. The top biological material were 06,02, and 03.

In the SVM model used in hw 8, the top manufacturing processes were 32,36,13,17,09,31 and biological material 06,03,02,12.

The variables that appear in all 3 were manufacturing processs 32, 17, 09, 13 and biological material 02 and 06. In all tree models, MP32 was the top variable.

c

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
single_tree <- train(x = train_predictors_chemicals,
                     y = train_yield,
                     method = "rpart",
                     preProc = c("center", "scale"),
                     tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
rpart.plot(single_tree$finalModel)

This tree branches off from manufacturing process 32 which makes sense since it was the highest ranked variable. It is interesting that biological material 11 was included as a branch since this was not a top 10 variable in any of the models tested.