library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.2.3
library(ipred)
## Warning: package 'ipred' was built under R version 4.2.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.2.3
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Warning: package 'caret' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.2.3
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

The model places more importance on 1,2, 4 and somewhat 5 (8.7, 6.4, 7.6, and 2.0, respectively) while 6-10 appear to have little (0.16 and below).

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

V1’s importance decreased from 8.7 to 5.6 when you add another predcitor that is highly correlated with it.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)
## Warning: package 'party' was built under R version 4.2.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.2.3
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
model3 <- cforest(y ~ ., data = simulated)
rfImp3 <- varimp(model3, conditional = TRUE) %>% 
  as.data.frame()
rfImp3
##                        .
## V1          1.8986239733
## V2          4.8021626697
## V3          0.0229993405
## V4          6.0471706526
## V5          0.9850544288
## V6         -0.0119652434
## V7         -0.0104327775
## V8         -0.0104862941
## V9          0.0004516316
## V10        -0.0074652543
## duplicate1  1.9703660210

Using the cforest function yields the same imporance patterns as the traditional random forest model, where V1, V2, V4 and 5 show more importance than the others.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(Cubist)
## Warning: package 'Cubist' was built under R version 4.2.3
model4 <- cubist(simulated[, colnames(simulated)[colnames(simulated) != 'y']], 
                 simulated$y)
rfImp4 <- varImp(model4, scale = FALSE)
rfImp4
##            Overall
## V1              50
## V2              50
## V4              50
## V5              50
## duplicate1      50
## V3               0
## V6               0
## V7               0
## V8               0
## V9               0
## V10              0

Yes, the same pattern occurs.

8.2 Use a simulation to show tree bias with different granularities.

set.seed(1234)

#simulate 100 samples, predictor x, nonlinear relationship
n <- 100
x <- runif(n, -5, 5)
y <- ifelse(x^2 + rnorm(n) > 1, 1, 0)

df <- data.frame(x = x, y = as.factor(y))

ctree_granular <- ctree(y ~ x, df, controls = ctree_control(mincriterion = 0.1))

plot(ctree_granular)

8.3 8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has a higher learning ratio and as the ratio increases towards 1, models will have less predictors needed for the response variable. It also has a higher bagging fraction (0.9 per the figure in the text), meaning less predictors will be important.

(b) Which model do you think would be more predictive of other samples?

In terms of the model on the right as the bagging fraction and learning ratio increase, the model’s performance will be inverse. Therefore, with more predictors being identified with more of a distributed importance, I think the model on the left would be more predictive of other samples.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

As interaction depth increases, the importance of predictors should spread over more predictors and the lines in the figure would spread further right.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

set.seed(789)

data(ChemicalManufacturingProcess)

# imputation
imp <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(imp, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(135)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

set.seed(246)
rfModel <- randomForest(train_x, train_y, importance = TRUE, ntree = 1000)
rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.9327773 0.8182165 0.7298731

Cubist

set.seed(246)
cubist_train <- train(train_x, train_y, method = "cubist")
cubistPred <- predict(cubist_train, test_x)

postResample(cubistPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.9571867 0.7379501 0.7042463

Single Tree

set.seed(246)
cart_train <- train(train_x, train_y,method = "rpart",tuneLength = 10,trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartPred <- predict(cart_train, test_x)

postResample(cartPred, test_y)
##      RMSE  Rsquared       MAE 
## 1.3913659 0.4252951 1.0420744

Bagged Trees

set.seed(246)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)
##      RMSE  Rsquared       MAE 
## 1.0442505 0.7046695 0.7522740

Cubist has the lower RMSE of 0.9316272 giving it the best test set performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

varImp(cubist_train, scale = TRUE)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  67.327
## ManufacturingProcess09  62.376
## ManufacturingProcess04  28.713
## ManufacturingProcess29  26.733
## BiologicalMaterial02    24.752
## BiologicalMaterial03    19.802
## ManufacturingProcess39  17.822
## ManufacturingProcess24  17.822
## ManufacturingProcess13  17.822
## ManufacturingProcess33  16.832
## ManufacturingProcess01  16.832
## BiologicalMaterial06    16.832
## ManufacturingProcess14  14.851
## ManufacturingProcess26  12.871
## ManufacturingProcess37  11.881
## ManufacturingProcess07  11.881
## BiologicalMaterial12    10.891
## ManufacturingProcess10   9.901
## ManufacturingProcess21   8.911

All of the models seem to show manufacturing processess atop, though in this model, biological materials 6,3,2 do show significance compared to the optimal linear and nonlinear models. Manufacturing 32 shows dominance as it did in the others, meaning it is seemingly important in the process.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
## Warning: package 'rpart' was built under R version 4.2.3
plot(varImp(cubist_train), top = 10) 

tree <- rpart(Yield ~ ., data = Chemical)
plot(tree)