Chapter 8 Regression Trees and Rule-Based Models
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson.
8.1. Recreate the simulated data from Exercise 7.2:
library(AppliedPredictiveModeling)
library(mlbench)
library(caret)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1|>dplyr::arrange(desc(Overall))varImpPlot(model1)Did the random forest model significantly use the uninformative predictors (V6 – V10)? Yes it does.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated1<-simulated
simulated1$duplicate1 <- simulated1$V1 + rnorm(200) * .1
cor(simulated1$duplicate1, simulated1$V1)## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated1,importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2|>dplyr::arrange(desc(Overall))varImpPlot(model2)#Add another predictor that is also highly correlated with V1
simulated1$duplicate2 <- simulated1$V1 + rnorm(200) * .5
model3 <- randomForest(y ~ ., data = simulated1,importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3|>dplyr::arrange(desc(Overall))varImpPlot(model3)Yes, the importance score for V1 decreases. And then it increases with the another predictor added.
(c) Use the cforest function in the package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
model_cforest<-cforest(y ~ ., data = simulated, controls = cforest_unbiased(ntree = 1000))
rfImp4 <- varimp(model_cforest)
sort(rfImp4, decreasing = TRUE)## V1 V4 V2 V5 V7 V3
## 9.004235181 8.211985992 6.821093759 1.835499278 0.034696633 0.026593276
## V6 V10 V9 V8
## -0.006475309 -0.018211637 -0.021021365 -0.051366993
V1 is the top importance predictor whilst V4 and V2 round up the top 3 importance.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
#Try boosted trees
library(gbm)
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
best.iter <- gbm.perf(gbmModel, method = "OOB")summary(gbmModel)#Try Cubist
library(Cubist)
cubistMod <- cubist(simulated[,-11], simulated$y)
summary(cubistMod)##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Mon Apr 10 23:14:44 2023
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (11 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
##
## outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.224012
## Relative |error| 0.55
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
##
##
## Time: 0.0 secs
The GBM model only have coefficient for 7 predictors, V8, V9 and V10 are not significant. The Cubist model is interesting and only provide a linear regression with V4, V1, V2 and V5 as the predictors.
Use a simulation to show tree bias with different granularities.
# Generate some sample data with a non-linear relationship between x and y
set.seed(3333)
x1 <- rep(1:2, each=100)
x2 <- rnorm(100, mean=0, sd=15)
y <- x1 + x2 + rnorm(100, 0, 1)
train_data<-as.data.frame(cbind(y=y, x1=x1, x2=x2))
rf_modl<-randomForest(y~., data=train_data)
varImp((rf_modl))In the textbook of Applied Predictive Modeling by Kuhn and Johnson, the author states that the predictors with a higher number of distinct values are favored over more granular predictors. “The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all”
x2 above has a much larger variations than x1 and it shows the importance is much higher than x1.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Also from the book, “this is due to the fact that the trees from boosting are dependent on each other and hence will have correlated structures as the method follows by the gradient. Therefore many of the same predictors will be selected across the trees, increasing their contribution to the importance metric.”
The magnitude of bagging fraction is small of the left and the number of trees are large. There will be more predictors. The smaller learning rate also requires more trees too.
As both bagging fraction and learning rate increase, the number of trees reduces and the number of predictors reduces as well.
(b) Which model do you think would be more predictive of other samples? It is more likely for a small bagging fraction and learning rate to be more predictive of other samples because there are more predictors.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? As the interaction depth of tree depth increase, more less importance predictors will “show up”. As such, the slope of predictor importance decreases.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: (a) Which tree-based regression model gives the optimal resampling and test set performance?
#Prepare the dataframe
data(ChemicalManufacturingProcess)
ChemicalManufacturingProcess<-as.data.frame(ChemicalManufacturingProcess)
#The Impute package is not available, I can only use preProcess with KNN for imputation
preProcValues<-preProcess(ChemicalManufacturingProcess[,-1], method="knnImpute", k=5, knnSummary=mean)
impute_CMP<-predict(preProcValues, ChemicalManufacturingProcess[,-1], na.action=na.pass)
#The impute_CMP looks strange and I found out it's normalized, we need to de-normalize and get the original data back
procNames <- data.frame(col = names(preProcValues$mean), mean = preProcValues$mean, sd = preProcValues$std)
for(i in procNames$col){
impute_CMP[i] <- impute_CMP[i]*preProcValues$std[i]+preProcValues$mean[i]
}
set.seed(2744)
fold2 <- ChemicalManufacturingProcess$Yield|> createDataPartition(p = 0.9, list = FALSE, times = 1)
#Create training and testing set
bio_manu_train<-impute_CMP[fold2,]
bio_manu_test<-impute_CMP[-fold2,]
yield_train<-ChemicalManufacturingProcess[fold2,1]
yield_test<-ChemicalManufacturingProcess[-fold2,1]
set.seed(9999)
#Try Single Tree Model
library(rpart)
rpartTune <- train(bio_manu_train, yield_train, method = "rpart2", tuneLength = 10, trControl = trainControl(method = "cv"))
rpartPred <- predict(rpartTune, newdata = bio_manu_test)
#Try Bagged Tree Model
library(ipred)
baggedTree <- ipredbagg(y=yield_train, X=bio_manu_train)
baggedTreePred<-predict(baggedTree, newdata = bio_manu_test)
#Try Random Forest Model
rfModel <- randomForest(bio_manu_train, yield_train, importance = TRUE,ntrees = 1000)
rfPred <- predict(rfModel, newdata = bio_manu_test)
#Try Boosted Tree Model
gbmGrid1 <- expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 100), .shrinkage = c(0.01, 0.1), .n.minobsinnode=10)
gbmTune1 <- train(bio_manu_train, yield_train, method = "gbm", tuneGrid = gbmGrid1, verbose=FALSE)
gbmPred <- predict(gbmTune1, newdata = bio_manu_test)
#Try Cubist model
cubistTuned <- train(bio_manu_train, yield_train, method = "cubist")
cubistPred <- predict(cubistTuned, newdata = bio_manu_test)
#Create a summary for all models:
rbind(
"Single Tree" = postResample(pred = rpartPred, obs = yield_test),
"Bagged Tree" = postResample(pred = baggedTreePred, obs = yield_test),
"Random Forest" = postResample(pred = rfPred, obs = yield_test),
"Boosted Tree" = postResample(pred = gbmPred, obs = yield_test),
"Cubist" = postResample(pred = cubistPred, obs = yield_test)
)## RMSE Rsquared MAE
## Single Tree 1.3013726 0.4789076 0.9911066
## Bagged Tree 1.1492040 0.5900280 0.8543055
## Random Forest 1.0991774 0.6656708 0.8210386
## Boosted Tree 1.1124656 0.6567608 0.8787618
## Cubist 0.9929228 0.7227200 0.7261956
Cubist gives the lowest RMSE and highest R2 amongs these models.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
CubistImp<-varImp(cubistTuned)
#Redo the previous linear model (PLS) and non-linear model Support Vector Machine model
svmRTuned2 <- train(x=bio_manu_train, y=yield_train, method = "svmRadial", preProc =c("center", "scale"), tuneLength = 14, trControl = trainControl(method = "cv"))
svmRPred2<-predict(svmRTuned2, bio_manu_test)
SVMRImp<-varImp(svmRTuned2)
bio_manu_pls <- train(x = bio_manu_train, y = yield_train, method = "pls", tuneLength = 20, trControl = trainControl(method = "cv"), preProcess = c("center", "scale"))
PLSImp<-varImp(bio_manu_pls)
#Plot all Important data in parallel
p1<-plot(PLSImp, top=10, main='Linear Model: PLS')
p2<-plot(SVMRImp, top=10, main='Non-linear Model: SVM')
p3<-plot(CubistImp, top=10, main='Tree-based regression Model: Cubist')
gridExtra::grid.arrange(p1, p2, p3, ncol = 3)The top 3 most important predictors are ManufacturingProcess32, ManufacturingProcess17 and BiologicalMaterial06. The Manufacturing process variables still dominate the list with 7 in top 10. All models have ManufacturingProcess32 as the top importance. Other variables are different. All models have manufacturing process variables dominate in top 10.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(partykit)
library(rpart.plot)
trainData1<-cbind(bio_manu_train,yield_train)
rpartTree <- rpart(yield_train ~ ., data = trainData1)
rpart.plot(rpartTree)Yes, it does provide clear picture on the importance of variables.