8.1. Recreate the simulated data from Exercise 7.2:

library(mlbench)
library(dplyr)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

rfImp1 %>% arrange(desc(Overall))

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

As shown above, the variables V6-V10 were properly recognized way less significant to the model.

(b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

set.seed(233)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
[1] 0.9501363

Fit another random forest model to these data. Did the importance score for V1 change?

model2 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)

rfImp2 %>% arrange(desc(Overall))

The importance of V1 did change, it decreased by a decent margin. The duplicate itself isn’t as important as the orignal though.

What happens when you add another predictor that is also highly correlated with V1?

set.seed(233)
simulated$duplicate2 <- simulated$V1 + rnorm(200)  * .1
cor(simulated$duplicate1, simulated$V1)
[1] 0.9501363
model3 <- randomForest(y ~ ., data = simulated,
                      importance = TRUE,
                      ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)

rfImp3 %>% arrange(desc(Overall))

The same thing as adding the first duplicate occurred, the value of V1 decreased again.

(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

library(party)

bagCtrl <- cforest_control(mtry = ncol(simulated) - 1)
baggedTree <- cforest(y ~ ., data = simulated, controls = bagCtrl)
set.seed(233)
condImp <- varimp(baggedTree,conditional = TRUE)

nonCondImp <- varimp(baggedTree, conditional = FALSE)
library(tibble)
library(DataExplorer)
par <- par(mfrow=c(1, 2))

 

as.data.frame(condImp) %>% rownames_to_column('Var') %>%   plot_bar(with="condImp")

as.data.frame(nonCondImp) %>% rownames_to_column('Var')  %>% plot_bar(with="nonCondImp")

We can see above that both the conditional and non conditional variable importance handled the duplicates far better than the random forest model, making the second duplicate entirely worthless. It would appear however, that the conditional model overall handled it best, properly recognizing the duplicate nature and lowering the importance of V1 and duplicate1.

(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2), 
                       n.trees = seq(100, 1000, by = 50),
                        shrinkage = c(0.01, 0.1), n.minobsinnode = 10)

set.seed(100)
gbmModel <- train(x = select(simulated,!'y') ,y = simulated$y, method = "gbm",
                  tuneGrid = gbmGrid,
                  verbose= F)
condImpGbm <- varImp(gbmModel)
condImpGbm$importance %>% rownames_to_column('Var') %>%   plot_bar(with="Overall")

library(Cubist)

cubistMod <- cubist(select(simulated,!'y'),simulated$y)
condImpCub <- varImp(cubistMod)
condImpCub %>% rownames_to_column('Var') %>%   plot_bar(with="Overall")

As the caret varImp does not take conditional as an argument, I checked both the gbm and cubist model one time. Cubist has the most interesting results. It shows what we would expect, the top 5 variables are all important, and the rest are worthless. The gbm model looks almost the same as the conditional cforest output. It would seem that cubist performed best overall.

8.2. Use a simulation to show tree bias with different granularities.

We can see below that despite all three variables contributing equally to the value of y, the only variable it cares for in the end is the first one with a large number of distinct values in it.

set.seed(123)
v1 <- runif(1000,1,1000)
v2 <- rnorm(1000,2,300)
v3 <- rpois(1000,2.4)

y = (v1 + v3 + v3) 
sim <- data.frame(y,v1,v2,v3)



simModel<- randomForest(y ~ ., data = sim,
                      importance = TRUE,
                      ntree = 1000)
varImp(simModel)

Repeating the same process, and reducing the number of unique values in the first variable, we find the model will always find the variable with the larger number of unique values and prfer it.

set.seed(123)
v1 <- runif(1000,1,5)
v2 <- rnorm(1000,2,300)
v3 <- rpois(1000,2.4)

y = (v1 + v3 + v3) 
sim <- data.frame(y,v1,v2,v3)



simModel<- randomForest(y ~ ., data = sim,
                      importance = TRUE,
                      ntree = 1000)
varImp(simModel)

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

The results seen are a due to the different bagging fraction chosen. The .1 fraction used on the leftmost tree makes it care for granularity more whereas the .9 on the right makes that less so important.

(b) Which model do you think would be more predictive of other samples?

The model on the left, as it is far more balanced and far less likely to overestimate based on the few very important predictors chosen by the model on the right.

(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Doing so would increase the number of predictors considered for the final model, making the results more balanced.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
library(RANN)

impute <- preProcess(ChemicalManufacturingProcess, "knnImpute")

 chem_data <- predict(impute, ChemicalManufacturingProcess)
chem_data <- chem_data %>% select(!nearZeroVar(.))
train_index_chen <- createDataPartition(chem_data$Yield , p=.8, list=F)

train_chem <-  chem_data[ train_index_chen,] 
       
  
test_chem <- chem_data[-train_index_chen,]

(a) Which tree-based regression model gives the optimal re-sampling and test set performance?

 randomForChem <- randomForest(Yield ~ ., data = train_chem,
                      importance = TRUE,
                      ntree = 1000)
bagCtrl <- cforest_control(mtry = ncol(train_chem) - 1)
baggedTreeChem <- cforest(Yield ~ ., data = train_chem, controls = bagCtrl)
set.seed(100)
gbmModelChem <- train(Yield ~ ., data = train_chem, method = "gbm",
                  tuneGrid = gbmGrid,
                  verbose= F)
set.seed(122)
cubistModChem <- cubist(select(train_chem,!'Yield'),train_chem$Yield)



cubistModChemAlt <- train(Yield ~ ., data = train_chem,
                     method = 'cubist')

As seen below, the caret cubist model has the best overall score.

postResample(predict(randomForChem,  test_chem), test_chem$Yield)
     RMSE  Rsquared       MAE 
0.6112199 0.5735795 0.4738552 
postResample(predict(baggedTreeChem,  newdata = test_chem), test_chem$Yield)
     RMSE  Rsquared       MAE 
0.6389091 0.5458071 0.4931080 
postResample(predict(gbmModelChem,  test_chem), test_chem$Yield)
     RMSE  Rsquared       MAE 
0.5924979 0.5978230 0.4380292 
postResample(predict(cubistModChem,  test_chem), test_chem$Yield)
     RMSE  Rsquared       MAE 
0.4931156 0.7252979 0.3967717 
postResample(predict(cubistModChemAlt ,  test_chem), test_chem$Yield)
     RMSE  Rsquared       MAE 
0.4718199 0.7488810 0.3751559 

(b) Which predictors are most important in the optimal tree-based regression model? - Do either the biological or process variables dominate the list? - How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

  It looks largely the same as both, with the top three variables here having the the most value where there it was more balanced for the other two models. it is clear that the process variables dominate the list overall.
plot(varImp(cubistModChemAlt),top=10)

(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

This model further reinforces the importance of process variables, they dominate the below tree. It also gives a clearer view of how to increase the yield of a given sample.

library(rpart.plot)

multi.class.model  = rpart(Yield~., data=train_chem)

  rpart.plot(multi.class.model)