Chapter 8 - Regression Trees and Rule-Based Models



8.1. Recreate the simulated data from Exercise 7.2:

> library(mlbench)
> set.seed(200)
> simulated <- mlbench.friedman1(200, sd = 1)
> simulated <- cbind(simulated$x, simulated$y)
> simulated <- as.data.frame(simulated)
> colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
> library(randomForest)
> library(caret)
> model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
> rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predic- tors (V6 – V10)?

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
> simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
> cor(simulated$duplicate1, simulated$V1)

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

  2. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure


library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a.

model1 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
(rfImp <- varImp(model1, scale = FALSE))
##         Overall
## V1   8.83890885
## V2   6.49023056
## V3   0.67583163
## V4   7.58822553
## V5   2.27426009
## V6   0.17436781
## V7   0.15136583
## V8  -0.03078937
## V9  -0.02989832
## V10 -0.08529218
  • The random forest does not use the uniformative predictors. In fact, the variable importance is in line with the variable numbering.
  • We can see that V1-V5 got much higher weights then V6-V10.
  • By rough estimate we see a cumulative absolute weight of 25 for V1-V5 and an absolute cumulative weight of about 0.6 for V6-V10, this shows a significant relative difference

b.

library(data.table)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
dt <- setDT(varImp(model1, scale = FALSE), keep.rownames = TRUE)[]
options(scipen=99)
dt[order(-Overall)]
##      rn     Overall
##  1:  V1  8.83890885
##  2:  V4  7.58822553
##  3:  V2  6.49023056
##  4:  V5  2.27426009
##  5:  V3  0.67583163
##  6:  V6  0.17436781
##  7:  V7  0.15136583
##  8:  V9 -0.02989832
##  9:  V8 -0.03078937
## 10: V10 -0.08529218
  • The importance score has decreased with the duplicate variable.
  • The interaction between the two variables is reducing the signal we receive from using simply one of the variables.
library(data.table)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .15
cor(simulated$duplicate2, simulated$V1)
## [1] 0.8945605
model2 <- randomForest(y ~., data = simulated, importance = TRUE, ntree = 1000)
rfImp <- varImp(model2, scale = FALSE)

dt <- setDT(varImp(model2, scale = FALSE), keep.rownames = TRUE)[]
options(scipen=99)
dt[order(-Overall)]
##             rn     Overall
##  1:         V4  7.21776316
##  2:         V2  6.45956335
##  3:         V1  5.65399070
##  4: duplicate1  2.65416045
##  5:         V5  1.97242335
##  6: duplicate2  1.58972104
##  7:         V3  0.60371103
##  8:         V6  0.12652966
##  9:        V10 -0.01073958
## 10:         V7 -0.01201493
## 11:         V9 -0.05637654
## 12:         V8 -0.09943217
  • We can see that V1 score has decreased, as the highly corelated predictor has been added.
  • So we can conclude that the importance of the variable will decrease if there are highly correlated predictors present.

c.

library(data.table)
fit.cforest <- cforest(y ~., data = simulated)

dt <- setDT(varimp(fit.cforest, conditional = TRUE) %>% data.frame(), keep.rownames = TRUE)[]
options(scipen=99)
dt[order(-.)]
##             rn           .
##  1:         V4  5.46562157
##  2:         V2  4.66059632
##  3:         V1  3.20446849
##  4:         V5  1.39716392
##  5: duplicate1  1.24840582
##  6: duplicate2  1.09574887
##  7:         V6  0.02341780
##  8:         V3 -0.09675796
##  9:         V9 -0.15673783
## 10:         V8 -0.19956241
## 11:        V10 -0.20368319
## 12:         V7 -0.24295729
dt <- setDT(varimp(fit.cforest, conditional = FALSE) %>% data.frame(), keep.rownames = TRUE)[]
options(scipen=99)
dt[order(-.)]
##             rn           .
##  1:         V4  6.34961451
##  2:         V1  6.26481997
##  3:         V2  5.37851273
##  4: duplicate1  3.52874469
##  5: duplicate2  3.20044071
##  6:         V5  2.06735420
##  7:         V6  0.13700519
##  8:         V3  0.10642765
##  9:         V7  0.06689043
## 10:         V9  0.04811456
## 11:         V8 -0.02418668
## 12:        V10 -0.22198342
  • We can see a smiliar behaivior, V1-V5 have the higherst importance while V6-V10 have the lower importance.

d.

fit.cubist <- cubist(simulated %>% select(-y), simulated$y, committees = 4)
fit.cubist
## 
## Call:
## cubist.default(x = simulated %>% select(-y), y = simulated$y, committees
##  = 4)
## 
## Number of samples: 200 
## Number of predictors: 12 
## 
## Number of committees: 4 
## Number of rules per committee: 1, 4, 1, 4
dt <- setDT(varImp(fit.cubist) %>% data.frame(), keep.rownames = TRUE)[]
options(scipen=99)
dt[order(-Overall)]
##             rn Overall
##  1:         V1    67.0
##  2:         V2    54.5
##  3:         V4    50.0
##  4:         V5    46.5
##  5:         V3    43.0
##  6: duplicate2    14.5
##  7:         V6     2.5
##  8: duplicate1     2.5
##  9:         V7     0.0
## 10:         V8     0.0
## 11:         V9     0.0
## 12:        V10     0.0
  • Cubist model is similar to the first models V1-V5 have a much higher importance then V6-V10, although we see that V7 has moved up relative to V6-V10.


8.2. Use a simulation to show tree bias with different granularities.


data.set <- twoClassSim(300, noiseVars = 8, corrVar = 6, corrValue = 0.8) %>%
            mutate(TwoFactor1 = as.factor(round(TwoFactor1, 0)),
                   TwoFactor2 = as.factor(round(TwoFactor2, 0)))

r.fit <- rpart(Linear01 ~ ., data=data.set)
rpart.plot(r.fit)

dt <- setDT(varImp(r.fit), keep.rownames = TRUE)[]
options(scipen=99)
dt[order(-Overall)]
##             rn    Overall
##  1:     Noise5 1.18482825
##  2:      Corr4 1.03593205
##  3:   Linear06 1.02184898
##  4: Nonlinear2 0.85142287
##  5:     Noise8 0.78952667
##  6:   Linear08 0.76693139
##  7: TwoFactor1 0.69097583
##  8:   Linear04 0.60181778
##  9:      Corr3 0.57738032
## 10:   Linear07 0.51707612
## 11:     Noise4 0.48274395
## 12:   Linear03 0.48268985
## 13:     Noise2 0.47560469
## 14:      Corr6 0.47212661
## 15:      Corr2 0.44345254
## 16:     Noise7 0.44232603
## 17:   Linear09 0.35801898
## 18:      Corr1 0.35266603
## 19:   Linear05 0.31854720
## 20:     Noise6 0.28661098
## 21:     Noise3 0.27578621
## 22: Nonlinear1 0.24194755
## 23:   Linear10 0.22428898
## 24: TwoFactor2 0.22242313
## 25: Nonlinear3 0.10469571
## 26:     Noise1 0.08518753
## 27:      Corr5 0.05646623
## 28:   Linear02 0.00000000
## 29:      Class 0.00000000
##             rn    Overall


8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
  2. Which model do you think would be more predictive of other samples?
  3. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

a.

  • The model on the right focuses its importance on just first few predictors because of the fact that as the learning rate increases, it becomes will use fewer predictors.
  • Also another reason is due to bagging fraction, the higher is the fraction, less predictors will be identified as important.

b.

  • If the parameters increase the model performance will then decrease. The model on the left will perform better.
  • The left model is using more predictors, while the model on the right relies mostly on top 3 predictors, so assumption can be made the the model on the left will be more accurate.

c.

  • Increasing Interaction depth will most likely force the model to spread the impportance across predictors.
  • So the model on the right will probably benefit more from it, but as a result, more weight wil be icnreased to predictors lagging importance.


8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

  1. Which tree-based regression model gives the optimal resampling and test set performance?
  2. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
  3. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

a.

We will import imputation and data splitting code from 6.3 below and will use several models to evaluate the results:

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess

df_imp1 <- missForest(df)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
df_imp <- df_imp1$ximp

data <- df_imp[,2:58]
target <- df_imp[,1]

training <- createDataPartition( target, p=0.75 )
predictor_training <- data[training$Resample1,]
target_training <- target[training$Resample]

predictor_testing <- data[-training$Resample1,]
target_testing <- target[-training$Resample1]
ctrl <- trainControl(method = "cv", number = 10)

Use Single Tree model and evaluate the results:

Tune:

rt_grid <- expand.grid(maxdepth= seq(1,10,by=1))
rt_Tune <- train(x = predictor_training, y = target_training, method = "rpart2", metric = "Rsquared", tuneGrid = rt_grid, trControl = ctrl)

Predict:

rt_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
##      RMSE  Rsquared       MAE 
## 1.5289341 0.4102548 1.2064965

Use Random forest model to evaluate:

Tune:

rf_grid <- expand.grid(mtry=seq(2,38,by=3))
rf_Tune <- train(x = predictor_training, y = target_training, method = "rf", tuneGrid = rf_grid, metric = "Rsquared", importance = TRUE, trControl = ctrl)

Predict:

rf_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
##      RMSE  Rsquared       MAE 
## 1.5289341 0.4102548 1.2064965

Use Cubist model:

Tune:

cube_grid <- expand.grid(committees = c(1, 5, 10, 20, 50), neighbors = c(0, 1, 3, 5))

cube_Tune <- train(x = predictor_training, y = target_training, method = "cubist", metric = "Rsquared", tuneGrid = cube_grid, trControl = ctrl)

Predict:

cube_pred = predict(cube_Tune, predictor_testing)
postResample(pred = cube_pred, obs = target_testing)
##      RMSE  Rsquared       MAE 
## 0.8414842 0.8466099 0.6354866
  • Cubist model has the best RMSE score, it is an improvement over MARS scores and over PLS scores that we had.

b.

Review the predictor importance of cubist model:

plot(varImp(cube_Tune), top=10, scales = list(y = list(cex = 0.8)))

  • Manufacturing process32 made the top of the list which is not a surprise as we already established that it is the most importnant predictor, followed by Manufacturingprocess17.
  • One thing to notice is that Cubist model heavily relys on top 2 predictors vs PLS, but we saw similar behaivior with MARS.
  • One surprise is that Cubist is using Manufacturer process 17 and not 9 like MARS, but then again we saw similar usage with PLS and other models.

c.

Plot the Optimal Sinlge tree:

plot(as.party(rt_Tune$finalModel),gp=gpar(fontsize=11))

  • As expected we can see the top predictors are Manufacture related processes.
  • Manufacturing process32 is at the top only few Biological processes affect target in any meaningful way.

  • At this point we have established that manufacturer processes have a much larger impacting role towards yield then biological factors
  • Cubist model is the latest and strongest confirmation of that, so any further study should be focused on finding optimal values for manufacturing proceses.