Trees and Rules

Libraries Required

library(mlbench)
library(mice)
library(randomForest)
library(caret)
library(ggplot2)
library(party)
library(gbm)
library(Cubist)
library(dplyr)
library(rpart)
library("partykit")

Excercise 8.1

8.1. Recreate the simulated data from Exercise 7.2:

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

We can see that V1 is the most significant variable followed by V4, but V6 - V10 are the least significant. Therefore, the random forest model did not significantly use the uninformative predictors.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

The previous important score V1 was 8.732. In the second model, V1’s variable importance score is 5.691 which is reduced by 3 points. The duplicate became a relatively important variable.

If we add another predictor, we would anticipate more reduction in V1’s importance.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
#cor(simulated$duplicate2, simulated$V1)

model3 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3

##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

As expected, V1’s variable importance score is reduced.

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

cf_model <- cforest(y ~., data = simulated)
varimp(cf_model)

##          V1          V2          V3          V4          V5          V6 
##  5.42152778  5.79075433 -0.01425384  5.96294311  1.87652947 -0.06822257 
##          V7          V8          V9         V10  duplicate1  duplicate2 
##  0.07696942 -0.17067028  0.02155362 -0.13376134  5.27935126  2.92224516

With the exception of the importance score for V3, the patterns are similar with V4 is the most important factor and V6 - V10 being insignificant.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Generalized Boosted Regression Modeling (GBM)

gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)

##                   var    rel.inf
## V4                 V4 30.8498158
## V2                 V2 20.9639117
## V1                 V1 16.1293953
## duplicate1 duplicate1 11.5441703
## V5                 V5 11.0492470
## V3                 V3  8.5487055
## duplicate2 duplicate2  0.7137245
## V6                 V6  0.2010299
## V7                 V7  0.0000000
## V8                 V8  0.0000000
## V9                 V9  0.0000000
## V10               V10  0.0000000

V4 is ranked as a top predictor, followed by V2.
The uninformative predictors V6- V10 are still ranked at the least significant.
The GBM model seems to indicate that V1 is much more important than duplicated1. This is because the trees from boosting are dependent on each other and will have correlated structures.

Cubist

#subset x columns
simulated_x <- subset(simulated, select = -c(y))

#run model and check variable importance
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
varImp(cubist_model)

##            Overall
## V1            50.5
## V3            50.5
## V2            63.0
## duplicate2    20.5
## V4            46.0
## duplicate1    28.5
## V8             3.5
## V5            28.5
## V6             8.5
## V7             0.0
## V9             0.0
## V10            0.0

The rankings are different above.

Excercise 8.2

8.2. Use a simulation to show tree bias with different granularities.

set.seed(555)
#specify features with different granularities
x1 <- sample(0:100000, 1000, replace = T)
x2 <- sample(0:1000, 1000, replace = T)
x3 <- sample(0:10, 1000, replace = T)

#Create y as a combination of our features with the addition of a random term

y <- x1 +  x2 + x3 + rnorm(200)

#Merge features into df for RF
df <- data.frame(x1, x2, x3, y)

rf_sim_model <- randomForest(y ~., data = df, importance = TRUE, ntree = 1000)
varImp(rf_sim_model, scale = FALSE)

##         Overall
## x1 1473915120.6
## x2    2102414.2
## x3    -584484.8

I generated random samples for 3 variables (x1, x2, x3). The target variable y is created by adding x1, x2, and x3 and see the output above that as we go from low to high definition (x3 to x1) the variable importance factor score increases.

It seems that the case is strong for greater granularity.

Excercise 8.3

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

knitr::include_graphics("Figure8.24.PNG")

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Gradient boosting trees make predictions by adding the predictions of each subsequent tree. These trees are dependent on each other - each tree is built upon the errors of previous tree. Therefore they have correlated structures. Suppose one of the tree split on one feature frequently. Its subsequent tree, which is built using its errors, will likely split on that feature frequently as well. As a result, many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.

Both boosting models a lower bagging fraction means a smaller subset of the data is chosen at random to comprise the training data whereas a lower shrinkage parameter means a slower learning rate, more iterations and each iteration carrying a smaller impact. From this we can extend that the model on the right is “greedy” (a rapid learner) and likely prone to over-fitting. It’s one that takes larger steps down the gradient descent, giving lower weight to more predictors, for sake of “learning faster”. This is why it gives importance to just a few variables. Conversely, the model on the left considers a smaller subset of data and iterates through more slowly. Iterating through in this way, and in series, reduces error and considers a larger subset of variables important for the very sake of having a more robust, less error prone model.

(b) Which model do you think would be more predictive of other samples?

I think that the model on the left, the one with a slower error rate and a smaller subset of data to train off of, would provide more accurate predictions. That’s because it could learn slower and reduce error across more iterations, while giving less weight to a greater breadth of predictors. Thus, reducing error and improving accuracy.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Interaction the interaction depth is used to specify the maximum nodes per tree, the number of splits to be performed on a tree. Increasing interaction depth then, would increase the level of importance given to each variable since a greater level of depth / number of nodes would create a more expansive consideration. Therefore, increasing the depth reduce the slope of the importance plot.

Excercise 8.7

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

#Load the data
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.3

data("ChemicalManufacturingProcess")
#convert data to matrix
Data_matrix <- as.matrix(ChemicalManufacturingProcess)

#impute missing values
set.seed(333)
imputed_Data_matrix <- mice(Data_matrix, printFlag=F, method="pmm")

## Warning: Number of logged events: 675

data_comp <- complete(imputed_Data_matrix)

#remove highly correlated features
highly_correlated = findCorrelation(cor(data_comp), 0.80)
data_comp1 = data_comp[, -highly_correlated]

#center and scale
data_comp2 <- scale(data_comp1, center = TRUE, scale = TRUE)

#convert to dataframe - COMMENTED OUT
cmp_df <- as.data.frame(data_comp2) #convert to df

#train test split - 75/25 split
sample_size = round(nrow(cmp_df)*.75)
index <- sample(seq_len(nrow(cmp_df)), size = sample_size)
 
train <- cmp_df[index, ]
test <- cmp_df[-index, ]
#remove yield from sets to feed models
train.x <- train[-c(1)]
test.x <- test[-c(1)]

Random Forest

set.seed(333)
#fit the model
rf_model <- randomForest(train.x, train$Yield, importance = TRUE, ntrees = 1000)
rf_model

## 
## Call:
##  randomForest(x = train.x, y = train$Yield, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 0.3845978
##                     % Var explained: 58.34

This might be a good model it just but we still need to check the accuracy statistics on test data.

rfPred <- predict(rf_model, newdata = test.x)
postResample(pred = rfPred, obs = test$Yield)

##      RMSE  Rsquared       MAE 
## 0.6426172 0.7131018 0.4846558

The RF model gives an RMSE of 0.64 and MAE of 0.484. We are looking for an RMSE of 0.2 - 0.5, an Rsquare >= 0.75, and an MAE as close to 0 as possible.

Boosted Trees

gbm_model <- gbm.fit(train.x, train$Yield, distribution = "gaussian")

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9223             nan     0.0010    0.0007
##      2        0.9218             nan     0.0010    0.0005
##      3        0.9212             nan     0.0010    0.0004
##      4        0.9203             nan     0.0010    0.0007
##      5        0.9195             nan     0.0010    0.0006
##      6        0.9188             nan     0.0010    0.0006
##      7        0.9183             nan     0.0010    0.0005
##      8        0.9176             nan     0.0010    0.0007
##      9        0.9170             nan     0.0010    0.0004
##     10        0.9163             nan     0.0010    0.0004
##     20        0.9106             nan     0.0010    0.0005
##     40        0.8976             nan     0.0010    0.0006
##     60        0.8854             nan     0.0010    0.0005
##     80        0.8740             nan     0.0010    0.0006
##    100        0.8618             nan     0.0010    0.0006

#gbm_model
gbmPred <- predict(gbm_model, newdata = test.x)

## Using 100 trees...

postResample(pred = gbmPred, obs = test$Yield)

##      RMSE  Rsquared       MAE 
## 1.0666441 0.5085017 0.8376582

The boosting model is giving a higher error and lower R-squared than the Random Forest model.

Cubist

cube_model <- cubist(train.x, train$Yield)
cube_model

## 
## Call:
## cubist.default(x = train.x, y = train$Yield)
## 
## Number of samples: 132 
## Number of predictors: 40 
## 
## Number of committees: 1 
## Number of rules: 3

cubePred <- predict(cube_model, newdata = test.x)
postResample(pred = cubePred, obs = test$Yield)

##      RMSE  Rsquared       MAE 
## 0.7480486 0.5561850 0.6085062

The Cubist model performs better than the Boosting model but it is not better than Random Forest model.

(a) Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest model with an RMSE of 0.64, Rsquared of 0.71 and MAE of 0.48 had the best performance. Random Forest model was the strongest model for optimal resampling and test set performance.

(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

The predictor importance can be found using the varImp function:

rf_imp <- varImp(rf_model)
rf_imp

##                            Overall
## BiologicalMaterial03   13.32003757
## BiologicalMaterial05    4.99087118
## BiologicalMaterial07    0.00000000
## BiologicalMaterial09    7.18392726
## BiologicalMaterial10    0.83173075
## BiologicalMaterial11   10.03737068
## ManufacturingProcess01  4.29511831
## ManufacturingProcess02  5.42691254
## ManufacturingProcess03  0.71751741
## ManufacturingProcess04  2.74021695
## ManufacturingProcess05  0.07638607
## ManufacturingProcess06  6.03235126
## ManufacturingProcess07  0.98781428
## ManufacturingProcess08 -0.53741719
## ManufacturingProcess09 11.65000301
## ManufacturingProcess10  2.79321544
## ManufacturingProcess11  7.40268624
## ManufacturingProcess12  1.85381976
## ManufacturingProcess13 10.37593848
## ManufacturingProcess16  1.73644414
## ManufacturingProcess17  9.08009916
## ManufacturingProcess19  1.43832058
## ManufacturingProcess20  5.17072854
## ManufacturingProcess21  1.05840688
## ManufacturingProcess22  3.42967163
## ManufacturingProcess23  2.76171735
## ManufacturingProcess24  3.89228496
## ManufacturingProcess26  4.35926201
## ManufacturingProcess28  7.29283763
## ManufacturingProcess30  4.55310702
## ManufacturingProcess32 22.95513243
## ManufacturingProcess34  2.38355098
## ManufacturingProcess35 -0.86557592
## ManufacturingProcess36  6.98750655
## ManufacturingProcess37  2.46940060
## ManufacturingProcess38  0.74832282
## ManufacturingProcess39  5.07388574
## ManufacturingProcess41 -0.01032035
## ManufacturingProcess43  5.16455119
## ManufacturingProcess44  2.99567777

ManufacturingProcess 32 is the most important variable.

varImpPlot(rf_model)

From the above variable importance plot, we can include that of the top 10 most important variables, 7 are manufacturing.As was the case for our linear and nonlinear regression models, the process variables dominate the list.

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

#Train single tree model
rpart_tree <- rpart(Yield ~., data = train)

#Produce tree plot
rpart_tree2 <- as.party(rpart_tree)
plot(rpart_tree2)

The tree above provides more information about the biological or process predictors and their relationship with yield. We can see which variables rank highest up in the tree (higher up –> greater importance) as well as the values at which sub-branches from variables are created. We can also determine at what level specifically we should tune a process variable to to increase yield. Overall, I believe the past 3 assignments have re-iterated that focusing on the higher order manufacturing processes (32, 9, 11, 17 from the above chart) would lead to the greatest net positive Yield.