Libraries

library(AppliedPredictiveModeling)
library(dplyr)
library(forecast)
library(ggplot2)
library(tidyr)
library(mice)
library(corrplot)
library(MASS)
library(earth)
library(RANN)

Exercise 8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, 
                       ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
order(-rfImp1)
##  [1]  1  4  2  5  3  6  7 10  9  8
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

Answer: Based on the ‘rfImp1’, uninformative predictors V6 – V10 were not significantly used in the random forest model. The predictors that were used significantly in the random forest model are: 1 - 4 - 2 - 5 - 3

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

Creating copy of original data:

simulated2 <- simulated
simulated2$duplicate1 <- simulated2$V1 + rnorm(200) * .1
cor(simulated2$duplicate1, simulated2$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated2, importance = TRUE, 
                       ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
order(-rfImp2)
##  [1]  4  2  1 11  5  3  6 10  9  7  8
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Answer: The importance score for V1 changed by adding a highly correlated value.The V1 predictor is no longer the most important, it is now the third.

  1. Use the ‘cforest’ function in the ‘party’ package to fit a random forest model using conditional inference trees. The ‘party’ package function ‘varimp’ can calculate predictor importance. The ‘conditional’ argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?**
library(party)
model3 <- cforest(y ~., data = simulated)
order(-varimp(model3, conditional = FALSE))
##  [1]  1  4  2  5  7  3  9  6 10  8
order(-varimp(model3, conditional = TRUE))
##  [1]  4  1  2  5  3  6  9  8 10  7

Answer: They do not show these significances the same pattern as the traditional random forest model. They are not in the same order.

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?**

Cubist

library(Cubist)
simulated_x <- subset(simulated, select = -c(y))
cubist_model <- cubist(x = simulated_x, y = simulated$y, committees = 100)
order(-varImp(cubist_model))
##  [1]  1  3  4  2  5  6  7  8  9 10

Boosted Trees

library(gbm)
set.seed(801)
gbm_model <- gbm(y ~., data = simulated, distribution = "gaussian")
summary.gbm(gbm_model)

##     var    rel.inf
## V4   V4 29.1564185
## V1   V1 26.9686148
## V2   V2 24.2398322
## V5   V5 11.5192020
## V3   V3  7.8231482
## V6   V6  0.2927843
## V7   V7  0.0000000
## V8   V8  0.0000000
## V9   V9  0.0000000
## V10 V10  0.0000000

Answer: With the different tree models, the pattern is not the same. In the Cubist model the V1 predictor is the most important. In the Boosted tree model, the V4 predictor is the most important. Regarding the predictors V6 - V 10, they were not used significantly in any of the other models.

Exercise 8.2

Use a simulation to show tree bias with different granularities.

set.seed(802)
low <- sample(0:10, 500, replace = T)
medium <- sample(0:100, 500, replace = T)
high <- sample(0:1000, 500, replace = T)

y <- low + medium + high + rnorm(200)

var(low)
## [1] 9.795796
var(medium)
## [1] 810.4978
var(high)
## [1] 81218.55
sim_data <- data.frame(low, medium, high, y)
diff_gran_model <- randomForest(y ~., data = sim_data, importance = TRUE, ntree = 1000)
varImp(diff_gran_model, scale=FALSE)
##            Overall
## low      -392.6412
## medium    902.1563
## high   139468.8096

Answer: We can see that even though all three variables contribute equally to the y-value, the only variable that matters to you in the end is the first one with a large number of distinct values. The resulting tree model confirms the tree bias in which variables with the lowest granularity are ranked with the highest importance.

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

Answer: The 0.1 fraction used on the leftmost tree makes it care for granularity more whereas the 0.9 on the right makes that less so important. The results seen are a due to the different bagging fraction chosen.

  1. Which model do you think would be more predictive of other samples?

Answer: I think the most predictive model is the model on the left. The model on the left is more balanced. Bagging fraction and shrinkage parameters: 0.1 in the model on the left.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Answer: Increasing the depth of the interaction would affect the number of predictors considered for the models. By increasing the number of predictors the results would be more balanced.

Exercise 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Load data:

data(ChemicalManufacturingProcess)
set.seed(807)
imputed_data <- preProcess(ChemicalManufacturingProcess, method = c("knnImpute","nzv", "corr"))
full_data <- predict(imputed_data, ChemicalManufacturingProcess)

index_chem <- createDataPartition(full_data$Yield , p=.8, list=F)
train_chem <-  full_data[index_chem,] 
test_chem <- full_data[-index_chem,]
train_predictors <- train_chem[-c(1)]
test_predictors <-  test_chem[-c(1)]
  1. Which tree-based regression model gives the optimal resampling and test set performance?

Random Forest

set.seed(817)
rf_model <- randomForest(train_predictors, train_chem$Yield, importance = TRUE, ntrees = 1000)
rf_model
## 
## Call:
##  randomForest(x = train_predictors, y = train_chem$Yield, importance = TRUE,      ntrees = 1000) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 0.3685206
##                     % Var explained: 62.06
rfPred <- predict(rf_model, newdata = test_predictors)
postResample(pred = rfPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.6415305 0.6845213 0.4895513

**Boosted Trees*

set.seed(827)
gbm_model <- gbm.fit(train_predictors, train_chem$Yield, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9705             nan     0.0010    0.0007
##      2        0.9698             nan     0.0010    0.0007
##      3        0.9690             nan     0.0010    0.0007
##      4        0.9683             nan     0.0010    0.0006
##      5        0.9676             nan     0.0010    0.0008
##      6        0.9670             nan     0.0010    0.0006
##      7        0.9663             nan     0.0010    0.0006
##      8        0.9655             nan     0.0010    0.0007
##      9        0.9647             nan     0.0010    0.0007
##     10        0.9639             nan     0.0010    0.0007
##     20        0.9569             nan     0.0010    0.0007
##     40        0.9434             nan     0.0010    0.0005
##     60        0.9298             nan     0.0010    0.0006
##     80        0.9172             nan     0.0010    0.0006
##    100        0.9051             nan     0.0010    0.0005
gbm_model
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 46 predictors of which 7 had non-zero influence.
gbmPred <- predict(gbm_model, newdata = test_predictors)
postResample(pred = gbmPred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 1.0173919 0.4896455 0.7944192

Cubist

set.seed(837)
cube_model <- cubist(train_predictors, train_chem$Yield)
cube_model
## 
## Call:
## cubist.default(x = train_predictors, y = train_chem$Yield)
## 
## Number of samples: 144 
## Number of predictors: 46 
## 
## Number of committees: 1 
## Number of rules: 1
cubePred <- predict(cube_model, newdata = test_predictors)
postResample(pred = cubePred, obs = test_chem$Yield)
##      RMSE  Rsquared       MAE 
## 0.6267279 0.6609403 0.4886768

Answer: The Cubist model provides optimal metrics, with the lowest RMSE of 0.62, and an R2 of 0.66.

  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
head(varImp(rf_model),10)
##                           Overall
## BiologicalMaterial01    4.0327661
## BiologicalMaterial03   12.0298968
## BiologicalMaterial05    0.2910504
## BiologicalMaterial06   11.2978147
## BiologicalMaterial08    5.5829684
## BiologicalMaterial09    5.3869876
## BiologicalMaterial10    2.3758324
## BiologicalMaterial11    6.4869886
## ManufacturingProcess01  2.7009899
## ManufacturingProcess02  3.3163745

The predictors dominating the list are ‘BiologicalMaterial’ with 8 in the top ten. The remaining two are ‘ManufacturingProcess’. The top 10 important predictors are different from the top 10 predictors of the optimal linear and nonlinear models. In the optimal linear and non-linear models there were more ‘ManufacturingProcess’ predictors.

  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart)
library(rpart.plot)
rpart_tree <- rpart(Yield ~., data = train_chem)
rpart.plot(rpart_tree)

Answer: This model further reinforces the importance of process variables. We can see in the graph that the most frequently seen variables are ‘ManufacturingProcess’, this suggests that the Manufacturing Process variables are strong predictors of the performance response variable. The graph gives a clearer view of how to increase the performance of a given sample.