8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.1.2
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.2
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
model1 <- randomForest(y ~ ., data = simulated,
mportance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
rfImp1
## Overall
## V1 1101.5125
## V2 918.2002
## V3 292.9121
## V4 1050.8320
## V5 489.9815
## V6 177.9841
## V7 199.3523
## V8 146.3390
## V9 156.1152
## V10 178.7002
The random forest model significantly did not use the uninformative predictors. The predictors V6 ~ V10 have very small importance, compare to other predictors such as V1, V2, V4, and V5. (b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9312879
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
m2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(m2, scale = FALSE)
rfImp2
## Overall
## V1 5.62411205
## V2 6.18242103
## V3 0.66445442
## V4 6.69869172
## V5 2.01545359
## V6 0.16952216
## V7 -0.03423950
## V8 -0.11680402
## V9 -0.04139870
## V10 0.01472598
## duplicate1 3.91374379
Yes, it did. The importance for V1 is reduced, splitted between V1 and duplicated1.It happened because the trees in the random forest are just as likely to pick V1 for a split than to pick duplicated1, because there is a correlation. As a result, the total number of splits that originally belonged to V1 is “shared” with duplicated1.
library(data.table)
## Warning: package 'data.table' was built under R version 4.1.2
library(party)
## Warning: package 'party' was built under R version 4.1.3
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.1.3
frstmodel <- cforest(y ~ ., data=simulated)
sort(varimp(frstmodel, conditional = FALSE), decreasing = TRUE)
## V4 V2 V1 duplicate1 V5 V7
## 7.469697868 6.286805180 6.102506073 3.339546539 1.697256639 0.016261856
## V3 V6 V9 V10 V8
## 0.013701626 0.002575282 -0.008888066 -0.014897148 -0.045566285
sort(varimp(frstmodel, conditional = TRUE), decreasing = TRUE)
## V4 V2 V1 duplicate1 V5 V3
## 5.910739023 5.026287021 2.863024722 1.110339875 1.019342990 0.016134445
## V6 V9 V7 V10 V8
## 0.011933896 0.004576673 0.003943168 -0.005913247 -0.008058880
After calculating conditionally the variable importance it shows that it takes into account correlation between variable V1 and duplicate1. It adjust importance score of these two variables accordingly. While the variable importance is calculated un-conditionally then it treats both highly correlated variables (V1 and duplicate1) with equal importance which can be mispresented.
library(gbm)
## Warning: package 'gbm' was built under R version 4.1.3
## Loaded gbm 2.1.8
gbmModel <- gbm(y ~ ., data = simulated, distribution = 'gaussian')
gbmModel
## gbm(formula = y ~ ., distribution = "gaussian", data = simulated)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 11 predictors of which 7 had non-zero influence.
summary(gbmModel)
## var rel.inf
## V4 V4 29.6311732
## V2 V2 23.9753767
## V1 V1 23.3175633
## V5 V5 10.1218535
## V3 V3 8.4820347
## duplicate1 duplicate1 4.3161926
## V6 V6 0.1558059
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
For Boosted Trees,the same patten occurs where V4 is still the highest and V6 thru V10 are still low.
library(Cubist)
## Warning: package 'Cubist' was built under R version 4.1.3
cmdl <- cubist(simulated[,-11], simulated$y)
cmdl
##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
## Number of samples: 200
## Number of predictors: 11
##
## Number of committees: 1
## Number of rules: 1
summary(cmdl)
##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Sun May 01 22:44:27 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (12 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
##
## outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.181831
## Relative |error| 0.54
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
##
##
## Time: 0.0 secs
It can be assumed that Cubist model is the only one which uses highly important variables.
V1, V2, V4 and V5 are the most important features that were used by Cubist model.
8.2. Use a simulation to show tree bias with different granularities.
set.seed(999)
x<- sample(0:10000 / 10000, 200, replace = T)
y<- sample(0:1000 / 1000, 200, replace = T)
w <- sample(0:100 / 100, 200, replace = T)
z <- sample(0:10 / 10, 200, replace = T)
p <- x + z + rnorm(200)
df <- data.frame(x, y, w, z, p)
str(df)
## 'data.frame': 200 obs. of 5 variables:
## $ x: num 0.911 0.544 0.62 0.673 0.241 ...
## $ y: num 0.569 0.398 0.99 0.82 0.692 0.635 0.556 0.839 0.282 0.763 ...
## $ w: num 0.99 0.23 0.31 0.79 0.13 0.96 0.68 0.47 0.68 0.22 ...
## $ z: num 0.6 0.7 0.4 0.7 0.7 0.6 0.8 0.9 0.5 0.7 ...
## $ p: num 2.09 1.16 1.6 1.36 2.09 ...
#using rpart for fitting
library(rpart)
tmodel <- rpart(y ~ ., data=df)
varImp(tmodel)
## Overall
## p 0.5814526
## w 0.8588503
## x 0.5198500
## z 0.4592129
The tree uses x mostly to split and it uses z the least often. On the other hand it also uses y and w to split, while they are not used to produce the model. Also, y and w have more different values compared to z. That shows that there ia a selection bias in the model.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudesof these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? Increasing a learning rate is intended to slow down the adaptation of the model to the training data, lower adaptation results in fewer predictors. The model on the right has a higher learning rate and this is closer to optimal. There are more variables with lower weight because this model effectively predicts more about the data. It will take longer to generate such a model.Additionally, the model on the right was also trained with more data which suppose to reduce the importance of marginal variables. On the other hand, the model on the left is less optimized, and with a smaller data set. It does not have enough information to reduce variable importance in the same way as the model on the right.
Which model do you think would be more predictive of other samples? It would be expected that the performance of the model on the right to be better on analogous data sets due to the level of tuning. But it could likely also generate more sensitive differences in the predicted data. However, since this involves a trade off between bias-variance, I believe that using cross validation or a test set.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? The approach which should be taken to increase interaction depth affect the slope of predictor importance for either model, which would means that increasing interaction will provide more dense trees, what would mean more predictors will be taken into account, so varimp function will show allocation of importance across more variables. Therefore, increasing the depth reduce the slope of the importance plot
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.1.3
data(ChemicalManufacturingProcess)
(cmpImpute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute')))
## Created from 152 samples and 57 variables
##
## Pre-processing:
## - bagged tree imputation (57)
## - ignored (0)
cmp <- predict(cmpImpute, ChemicalManufacturingProcess[,-c(1)])
set.seed(999)
trainRow <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
trainx <- cmp[trainRow, ]
trainy <- ChemicalManufacturingProcess$Yield[trainRow]
testx <- cmp[-trainRow, ]
testy <- ChemicalManufacturingProcess$Yield[-trainRow]
rpmodel <- train(x = trainx,
y = trainy,
method = "rpart",
tuneLength = 10,
control = rpart.control(maxdepth=2))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
rpmodel
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.00771655 1.545751 0.3337652 1.230786
## 0.01452393 1.545751 0.3337652 1.230786
## 0.02597846 1.547909 0.3321186 1.233044
## 0.03316588 1.548980 0.3304596 1.232825
## 0.03583795 1.547007 0.3313855 1.230445
## 0.03633812 1.547007 0.3313855 1.230445
## 0.05190902 1.555853 0.3243507 1.240100
## 0.06399225 1.549841 0.3256107 1.234475
## 0.08270155 1.556993 0.3164081 1.235518
## 0.40695439 1.628733 0.3026192 1.308738
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01452393.
rmodel <- train(x = trainx,
y = trainy,
method = 'rf',
tuneLength = 10)
rmodel
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.361659 0.4927028 1.0777609
## 8 1.274503 0.5375705 1.0033760
## 14 1.262533 0.5388912 0.9916761
## 20 1.258331 0.5380994 0.9897108
## 26 1.256970 0.5363426 0.9853907
## 32 1.257390 0.5343093 0.9834924
## 38 1.261075 0.5301549 0.9857780
## 44 1.265822 0.5254258 0.9892238
## 50 1.274982 0.5194139 0.9952460
## 57 1.278971 0.5156588 0.9989663
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 26.
resamp <- resamples(list(SingleTree=rpmodel, RandomForest=rmodel))
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: SingleTree, RandomForest
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 1.0589102 1.1703070 1.2109348 1.2307862 1.264816 1.506939 0
## RandomForest 0.7779966 0.8766132 0.9768654 0.9853907 1.098811 1.241988 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 1.240733 1.457838 1.529595 1.545751 1.637663 1.871156 0
## RandomForest 1.009419 1.139510 1.226476 1.256970 1.405009 1.548129 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 0.04207691 0.2758174 0.3378588 0.3337652 0.4020415 0.4716263 0
## RandomForest 0.37073833 0.4672912 0.5365569 0.5363426 0.5926409 0.7533917 0
Random’s forecast mean R^2 is higher than single tree.
varImp(rpmodel)
## rpart variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess13 100.00
## BiologicalMaterial03 69.25
## BiologicalMaterial06 67.96
## ManufacturingProcess32 63.24
## ManufacturingProcess36 39.39
## ManufacturingProcess06 35.81
## ManufacturingProcess17 34.45
## ManufacturingProcess11 30.80
## BiologicalMaterial11 30.37
## BiologicalMaterial12 30.15
## BiologicalMaterial02 28.26
## ManufacturingProcess30 0.00
## BiologicalMaterial10 0.00
## ManufacturingProcess27 0.00
## ManufacturingProcess26 0.00
## ManufacturingProcess16 0.00
## ManufacturingProcess18 0.00
## ManufacturingProcess08 0.00
## ManufacturingProcess33 0.00
## ManufacturingProcess12 0.00
varImp(rmodel)
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## BiologicalMaterial03 24.652
## ManufacturingProcess13 22.916
## BiologicalMaterial06 18.003
## ManufacturingProcess36 14.298
## BiologicalMaterial12 12.785
## ManufacturingProcess09 12.177
## ManufacturingProcess17 12.146
## ManufacturingProcess31 11.506
## ManufacturingProcess28 9.131
## BiologicalMaterial11 8.747
## ManufacturingProcess06 8.242
## BiologicalMaterial02 7.908
## BiologicalMaterial04 7.104
## BiologicalMaterial05 6.723
## ManufacturingProcess20 6.010
## ManufacturingProcess33 5.430
## BiologicalMaterial09 5.140
## ManufacturingProcess11 4.883
## ManufacturingProcess24 4.534
ManuFacturing materials are highest ranking but in top 10 they are equal.
plot(rpmodel$finalModel)
text(rpmodel$finalModel)
Yes, it does show that that ManuFacturing materials are in higher importance compare to biological. Higher value of ManufacturingProcess32 leads to higher yield, and vice versa. The assign values are the means of the groups splitted by ManufacturingProcess32 at 159.5, as demonstrated below: