8.1 Random Forest

Prep Data

Recreate the simulated data from Exercise 7.2:

library(rpart)
library(randomForest)
library(caret)
library(tidyverse)
library(gbm)
library(mlbench)
library(Cubist)
library(gbm)

set.seed(200)
simulated <- mlbench.friedman1(200,sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# X <- simulated %>% select(!y)
# Y <- simulated %>% select(y)

(a) Fit Random Forest

Fit a random forest model to all of the predictors, then estimate the variable importance scores. Did the random forest model significantly use the uninformative predictors (V6 – V10)?

V6-V10 are the least informative predictors and appear toward the bottom of the tree. Varaibles 1, 4, and 2 are most important and will occupy top nodes.

# tune a random forest model 
set.seed(1984)
model1 <- train(y~., data=simulated,
      method="rf",
      tuneLength = 10,
      trControl = trainControl(method="cv"))

## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .

# calculate variable importance
rfImp1 <- varImp(model1, scale=F)
plot(rfImp1, title="Random Forest Model 1")

(b) Add predictors

Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

A new variable that is highly correlated with V1 is added and its effect on the random forest model is evaluated. As seen from the calculation of importance, each time a new variable that is correlated with V1 is added, V1’s importance decreases as its power in splitting decreases.

# create highly correlated variable 
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9487342

# run model 2 and get var importance
model2 <- train(y~., data=simulated,
      method="rf",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
rfImp2 <- varImp(model2, scale=F)
rfImp2 %>% plot

# repeat
# create highly correlated variable 
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.05
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9843647

# run model 2 and get var importance
model3 <- train(y~., data=simulated,
      method="rf",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
rfImp3 <- varImp(model3, scale=F)
rfImp3 %>% plot()

(c) Conditional inference trees

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

The same procedure as used in part (b) for random forest is used again with conditional inference. The same patterm emerges (although to a lesser extent) as the importnace of Varaible1 drops as correalted varaibles are added,

library(party)

## Warning: package 'party' was built under R version 4.0.5

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.0.5

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

simulated <- simulated %>% select(-c(duplicate1,duplicate2))

# tune a random forest model with conidtional inference
set.seed(1984)
cfmodel1 <- train(y~., data=simulated,
      method="cforest",
      tuneLength = 10,
      trControl = trainControl(method="cv"))

## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .

# calculate and plot variable importance
cfImp1 <- varImp(cfmodel1, scale=F)
plot(cfImp1, title="Random Forest Model 1")

# create highly correlated varaibles
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

# run model 2 and get var importance
cfmodel2 <- train(y~., data=simulated,
      method="cforest",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
cfImp2 <- varImp(cfmodel2, scale=F)
cfImp2 %>% plot

# repeat
# create highly correlated variable 
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.05
cor(simulated$duplicate2, simulated$V1)

## [1] 0.9835492

# run model 2 and get var importance
cfmodel3 <- train(y~., data=simulated,
      method="cforest",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
cfImp3 <- varImp(cfmodel3, scale=F)
cfImp3 %>% plot()

(d) Other tree models

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

The same process is repeated for the Cubist model. This time variable 1 retains the top node even when correlated variable 1 is added. Once the second correlated variable is added, it drop in importance relative to the other variables. Perhaps somewhat trivially, duplicate 1 becomes the most important while duplicate 2 is barely significant.

simulated <- simulated %>% select(-c(duplicate1,duplicate2))

# tune a random forest model with conidtional inference
set.seed(1984)
cubemodel1 <- train(y~., data=simulated,
      method="cubist",
      tuneLength = 10,
      trControl = trainControl(method="cv"))

# calculate and plot variable importance
cubeImp1 <- varImp(cubemodel1, scale=F)
plot(cubeImp1, title="Random Forest Model 1")

# create highly correlated varaibles
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1

# run model 2 and get var importance
cubemodel2 <- train(y~., data=simulated,
      method="cubist",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
cubeImp2 <- varImp(cubemodel2, scale=F)
cubeImp2 %>% plot

# repeat
# create highly correlated variable 
simulated$duplicate2 <- simulated$V1 + rnorm(200) * 0.05

# run model 2 and get var importance
cubemodel3 <- train(y~., data=simulated,
      method="cubist",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
cubeImp3 <- varImp(cubemodel3, scale=F)
cubeImp3 %>% plot()

8.2 Tree Bias

Use simulation to show tree bias with different granularities

One of the drawbacks of decision trees stems from selection bias, where a higher number of distinct values are favored over more granular predictors, due to potential split points resulting in a tree with potentially noisy predictors occupying nodes toward the top. By simulating a dataset with a granular and not granular variable we can see how selection bias effects variable importance. The granular predictor (X1)is more correlated with the target, while the non-granular is less correlated. Despite this, X2 is more importance to the decision tree model.

set.seed(666)
X1 <- rep(1:4, each=50)
Y <- X1 + rnorm(200, mean=0, sd=2)
set.seed(624)
X2 <- rnorm(200, mean=mean(Y), sd=4)
simData <- data.frame(Y=Y, X1=X1, X2=X2)
set.seed(624)
cor(simData)

##             Y         X1         X2
## Y  1.00000000 0.44953061 0.09469228
## X1 0.44953061 1.00000000 0.07241591
## X2 0.09469228 0.07241591 1.00000000

fit <- rpart(Y ~ ., data = simData)
varImp(fit)

##      Overall
## X1 0.2582145
## X2 0.3506435

8.3 Stochastic Gradient Boosting

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Predictor importance

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

In stochastic gradient boosting, regression trees produced at a given iteration step are dependent on values of the previous step (unlike random forests). The tuning parameters, bagging fraction and learning rate govern the dependence of values output by one tree on the next iteration. In Figure 8.24, the effect of tuning parameters is visible. When the learning rate(fraction of predicted value carried over) and bagging fraction (fraction of bootstrapped sample of data per iteration) are set on the extreme high (0.9) then just a few predictors dominate in terms of importance. When the learning and bagging rates are reduced, we see a less dramatic drop off in predictor importance. Bagging rate and learning rate are parameters that can and should be tuned when building a stochastic gradient boosting model.

(b) Model predictive power

Which model do you think would be more predictive of other samples?

Although more computationally expensive, the model on the left would probably make for a better predictions since it is more generalized and less biased. A lower learning rate is usually preferred, but the algorithm will require more iterations, so a balance must be made.

(c) Interaction depth

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would increase the usage of the less important predictors and as a result in a more gradual slope.

8.7 Ex: Chemical manufacturing process

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Preprocessing

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.4

library(DMwR)

## Warning: package 'DMwR' was built under R version 4.0.4

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(dplyr)

data("ChemicalManufacturingProcess")

# impute with KNN from DMwR package
ChemicalManufacturingProcess <-  knnImputation(ChemicalManufacturingProcess, k=3)

# convert to dataframe: target is the first variable
chem_df <- ChemicalManufacturingProcess %>% 
  as.data.frame()

# function removes sparse variables with no predictive power
chem_df <- chem_df[,-nearZeroVar(chem_df)]

# preprocess: center and scale
chem_df <- scale(chem_df,center=T,scale=T) %>% 
  as.data.frame()

# Train/Test Split
set.seed(2021)
trainIndex <- createDataPartition(chem_df$Yield , p = .8) %>% 
  unlist()
trainY <- chem_df[ trainIndex, 'Yield']
trainX  <- chem_df[trainIndex, -1]
train <- cbind(trainY,trainX)
testY <- chem_df[-trainIndex, 'Yield']
testX <- chem_df[-trainIndex, -1]

Model 1. Random Forest

A random forest model is tuned using cross-validation. RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 14.

# tune a random forest model 
set.seed(1984)
rf <- train(x=trainX, y=trainY,
      method="rf",
      tuneLength = 10,
      trControl = trainControl(method="cv"))
plot(rf)

rf_pred <- predict(rf, testX)
rf_acc <- postResample(pred=rf_pred, obs=testY)

Model 2. Gradient Boosted

A gradient boosted model is tuned using cross-validation. Tuning parameter ‘n.minobsinnode’ was held constant at a value of 10. RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 2500, interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 10.

# define tuning grid
grid <- expand.grid(interaction.depth=c(1, 3, 5), n.trees = (0:50)*50,
                   shrinkage=c(0.01, 0.001),
                   n.minobsinnode=10)
set.seed(1984)
boosted <- train(x=trainX, y=trainY,
                 method="gbm",
                 tuneGrid=grid,
                 verbose=F)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

gb_pred <- predict(boosted, testX)
gb_acc<- postResample(pred=gb_pred, obs=testY)

Model 3. Cubist

A cubist model is tuned using cross-validation.

set.seed(101)
cu <- train(x=trainX, y=trainY,
                 method="cubist",
                 verbose=F)

cu_pred <- predict(cu, testX)
cu_acc <- postResample(pred=cu_pred, obs=testY)

(a) Best model

Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest is the best predictor according to the RMSE, MAE, and R-Squared metrics.

df <- data.frame(cu_acc, rf_acc, gb_acc)
colnames(df) <- c("Cubist", "Random Forest", "Gradient Boosted")
df

##             Cubist Random Forest Gradient Boosted
## RMSE     0.6222242     0.5772369        0.6788049
## Rsquared 0.6148573     0.6003454        0.4999687
## MAE      0.4681127     0.4659928        0.5000961

(b) Predictor importance

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Like the linear and non-linear models, Manufacturing Process 32 and 12 are the most important. In this list, Manufacturing process are at the top but they don’t dominate like in some of the other models.

rfImp <- varImp(rf, scale=F)
rfImp

## rf variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  21.686
## ManufacturingProcess13  10.359
## BiologicalMaterial12     9.103
## BiologicalMaterial03     7.720
## ManufacturingProcess31   7.611
## ManufacturingProcess17   5.732
## ManufacturingProcess09   5.672
## BiologicalMaterial06     5.477
## ManufacturingProcess06   5.069
## ManufacturingProcess36   4.743
## BiologicalMaterial02     3.849
## BiologicalMaterial11     3.527
## BiologicalMaterial04     3.379
## BiologicalMaterial08     3.355
## ManufacturingProcess28   3.265
## ManufacturingProcess11   2.943
## ManufacturingProcess18   2.373
## ManufacturingProcess29   2.361
## BiologicalMaterial05     2.276
## ManufacturingProcess24   1.899

(c) Visualization

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

From the plot we can see that Manufacturing processing occupy the top nodes and is valuable information in informing plant processing decisions.

library(partykit)

## Warning: package 'partykit' was built under R version 4.0.5

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 4.0.5

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

df <- cbind(trainY, trainX)
plot <- rpart(trainY~.,data=df)
plot2 <- as.party(plot)
plot(plot2)

DATA624_HW9

Robert Welk

5/2/2021