Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Yes, they were used, but their importance was significantly smaller than V1-5.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the
importance score for V1 change?
The importance score for V1 did change (from 8.73 to 8.69), but it was
still the highest importance. The new variable, duplicate1, was not
considered important.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
What happens when you add another predictor that is also highly correlated with V1?
Adding second additional predictor that is also highly correlated with V1 had a greater impact on the importance of V1 (it went from 8.69 to 6.02) then the first highly correlated additional predictor. V1 is no longer the highest importance (that is now v4, 6.93). Also, the first additional predictor, duplicate1, did not show as important either when it was the only duplicate or here. But when the second additional predictor, duplicate2, was added duplicate2 had an importance (3.823).
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
The top 5 predictors were the same in both the traditional random forest model and the cforest model. The values were similar, but different, and the order was slightly changed, with the V2 and V1 being 2nd and 3rd (respectively) in the traditional rf and here with cforest it was v1 as 2nd and v2 as 3rd (they swapped places).
model4 <- cforest(y ~ ., data = simulated)
rfImp4 <- varimp(model4, scale = FALSE)
rfImp4
## V1 V2 V3 V4 V5 V6
## 5.42152778 5.79075433 -0.01425384 5.96294311 1.87652947 -0.06822257
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.07696942 -0.17067028 0.02155362 -0.13376134 5.27935126 2.92224516
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
For the boosted trees, the top 5 are the same as in traditional rf and cforest, with the order matching the traditional rf. The scale of the importance is much larger.
For the Cubist model, 5 predictors have the same importance value. 4 of the 5 overlap with the previous models, but duplicate2, the second highly correlated variable we introduced, was not included, and v6 was.
#boosted
gbmModel <- gbm(y ~ .,
data = simulated,
distribution = "gaussian")
summary(gbmModel)
## var rel.inf
## V4 V4 30.8498158
## V2 V2 20.9639117
## V1 V1 16.1293953
## duplicate1 duplicate1 11.5441703
## V5 V5 11.0492470
## V3 V3 8.5487055
## duplicate2 duplicate2 0.7137245
## V6 V6 0.2010299
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
cubistModel <- cubist(simulated[,-11],simulated$y )
varImp(cubistModel)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate2 0
Use a simulation to show tree bias with different granularities.
The text (p182) states: “…these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors”.
Creating a simulated dataset with 2 predictors:
Create a single tree, and look at the importance. Though both x1 and x2 contribute to y, x1 with the larger range of values is given a higher importance.
#create the data
set.seed(888)
x1 <- runif(500, 0, 1000) # range of 0 - 1000
x2 <- runif(500, 0, 10) # range of 0 - 10
#make y based on both x1 & x2 + some noise
y <- .5 * x1 -2 * x2 + rnorm(500, sd=2)
sim <- data.frame(x1, x2, y)
head(sim)
## x1 x2 y
## 1 25.50811 6.606107 -2.985639
## 2 346.68075 1.701234 168.820088
## 3 61.24983 8.801614 11.531905
## 4 683.75206 2.359146 336.659749
## 5 767.25377 3.315835 375.861369
## 6 104.10757 8.385882 36.251889
Tree <- rpart(y ~ ., data = sim)
varImp(Tree, scale = FALSE)
## Overall
## x1 5.2760359
## x2 0.2954787
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
From the text (pg 206):
* “… constrain the learning process by employing regularization, or
shrinkage…Instead of adding the predicted value for a sample to previous
iteration’s predicted value, only a fraction of the current predicted
value is added to the previous iteration’s predicted value. This
fraction is commonly referred to as the learning
rate and is parameterized by the symbol, λ. This parameter
can take values between 0 and 1 and becomes another tuning parameter for
the model.”
* “…randomly select a fraction of the training data. The residuals and
models in the remaining steps of the current iteration are based only on
the sample of data. The fraction of training data used, known as the
bagging fraction, then becomes another tuning
parameter for the model…”
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Since the right model uses a learning and bagging rate of 0.9, it is taking 90% (almost all) of the current predicted value of the prior iteration’s predicted value on 90% of the training data. By doing this, each step reinforces the results of the prior step, and we are left with larger importance of a smaller number of predictors.
Which model do you think would be more predictive of other samples?
The model on the right, with the learning and bagging rate of 0.1 results in a model with a larger number of predictors, which is more likely to be more predictive of other samples.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
The larger value of shrinkage, in this case when the learning rate is 0.9, has an impact on reducing RMSE for all choices of tree depths and number of trees. (text, pg 206-207)
Increasing the interaction (tree) depth will have an impact on both models in Fig 8.24, but it will have a larger impact on left, where the learning rate is 0.1.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Replicating the preprocessing used in 6.3 and 7.5:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
cmp <- preProcess(ChemicalManufacturingProcess,
method = "knnImpute")
cmp <- predict(cmp, ChemicalManufacturingProcess)
#split
set.seed(888)
train_rw2 <- createDataPartition(cmp$Yield,
p = 0.8,
list = FALSE)
train_cmp <- cmp[train_rw2,-1]
train_yld <- cmp$Yield[train_rw2]
test_cmp <- cmp[-train_rw2,-1]
test_yld <- cmp$Yield[-train_rw2]
train_cmp2 <- train_cmp[,-nearZeroVar(train_cmp)]
test_cmp2 <- test_cmp[,-nearZeroVar(train_cmp)]
Which tree-based regression model gives the optimal resampling and test set performance?
Tried Random Forest, Boosted Tree and Cubist, with the following results:
Random Forest
* The \(R^2\) tells us that 63% of the
variation in the test data can be predicted using this model
* RSME tells us that the sqrt of the average of squared differences
between prediction and actual observation of the test data is 0.549
GBM
* The \(R^2\) tells us that 64% of the
variation in the test data can be predicted using this model
* RSME tells us that the sqrt of the average of squared differences
between prediction and actual observation of the test data is 0.537
Cubist
* The \(R^2\) tells us that 68% of the
variation in the test data can be predicted using this model
* RSME tells us that the sqrt of the average of squared differences
between prediction and actual observation of the test data is 0.528
Cubist has the best results, with the largest \(R^2\) and the smallest RSME.
rf <- train(train_cmp2, train_yld,
method = 'rf',
importance = TRUE)
rfPred <- predict(rf, newdata= test_cmp2)
postResample(pred= rfPred, obs=test_yld)
## RMSE Rsquared MAE
## 0.5443417 0.6355169 0.4227794
gbmGrid <- expand.grid(.interaction.depth = seq(1, 7, by = 2),
.n.trees = seq(100, 1000, by = 50),
.shrinkage = c(0.01, 0.1),
.n.minobsinnode = 10)
gbm <- train(train_cmp2, train_yld,
method = 'gbm',
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbm, newdata= test_cmp2)
postResample(pred= gbmPred, obs=test_yld)
## RMSE Rsquared MAE
## 0.5228054 0.6637246 0.4052370
cube <- train(train_cmp2, train_yld,
method = 'cubist',
verbose = FALSE)
cubePred <- predict(cube, newdata= test_cmp2)
postResample(pred= cubePred, obs=test_yld)
## RMSE Rsquared MAE
## 0.5277177 0.6755219 0.3931756
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
See list of predictors below. There are 9 Manufacturing and 1 Biological in the top 10, so Manufacturing ‘dominates’.
Comparing to the results to those with linear model and nonlinear models:
varImp(cube)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 75.000
## ManufacturingProcess09 55.682
## ManufacturingProcess01 35.227
## BiologicalMaterial12 27.273
## ManufacturingProcess13 23.864
## ManufacturingProcess29 23.864
## ManufacturingProcess33 20.455
## ManufacturingProcess25 18.182
## BiologicalMaterial02 15.909
## ManufacturingProcess04 15.909
## BiologicalMaterial03 10.227
## ManufacturingProcess26 6.818
## BiologicalMaterial06 6.818
## ManufacturingProcess31 6.818
## ManufacturingProcess14 5.682
## ManufacturingProcess45 5.682
## ManufacturingProcess27 5.682
## ManufacturingProcess28 5.682
## BiologicalMaterial04 4.545
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
From this tree plot, we can see:
* 17% of the results - the 8th (7%) + 9th (10%) leaves - can be
predicted without any BiologicalMaterial (BM) factored in. These leaves
come from ManufacturingProcess (MP) 32, then MP 31 then MP 4.
* 58% of the results (the right side of the tree, leaves 6 - 12) have
only 1 BM (either BM 12 for leaves 6 & 7 or BM 03 for 9-12).
* Only the leftmost 3 leaves (for a total of 21%) have more BM than MP
(using MP32, BM12, BM05 and BM09).
This clearly shows the dominance of the MP predictors.
#since we are going to plot this, let's rename relevant fields
train_cmp2 <- train_cmp2 |>
rename(MP32 = ManufacturingProcess32,
MP31 = ManufacturingProcess31,
MP17 = ManufacturingProcess17,
MP04 = ManufacturingProcess04,
BM12 = BiologicalMaterial12,
BM05 = BiologicalMaterial05,
BM03 = BiologicalMaterial03,
BM09 = BiologicalMaterial09,
BM10 = BiologicalMaterial10)
set.seed(888)
tree2 <- train(train_cmp2, train_yld,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
rpart.plot(tree2$finalModel)