Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
head(simulated)
## V1 V2 V3 V4 V5 V6 V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
## V8 V9 V10 y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817
str(simulated)
## 'data.frame': 200 obs. of 11 variables:
## $ V1 : num 0.534 0.584 0.59 0.691 0.667 ...
## $ V2 : num 0.648 0.438 0.588 0.226 0.819 ...
## $ V3 : num 0.8508 0.6727 0.4097 0.0334 0.7168 ...
## $ V4 : num 0.1816 0.6692 0.3381 0.0669 0.8032 ...
## $ V5 : num 0.929 0.1638 0.8941 0.6374 0.0831 ...
## $ V6 : num 0.3618 0.4531 0.0268 0.525 0.2234 ...
## $ V7 : num 0.827 0.649 0.179 0.513 0.664 ...
## $ V8 : num 0.421 0.845 0.35 0.797 0.904 ...
## $ V9 : num 0.5911 0.9282 0.0176 0.6899 0.397 ...
## $ V10: num 0.589 0.758 0.444 0.445 0.55 ...
## $ y : num 18.5 16.1 17.8 13.8 18.4 ...
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.83890885
## V2 6.49023056
## V3 0.67583163
## V4 7.58822553
## V5 2.27426009
## V6 0.17436781
## V7 0.15136583
## V8 -0.03078937
## V9 -0.02989832
## V10 -0.08529218
Did the random forest model significantly use the uninformative predictors (V6 – V10)? For the uninformative predictors (V6 – V10), from the above table I can see the predictor used a very small importance for uninformative predictors comparing to V1 - V5.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 6.29780744
## V2 6.08038134
## V3 0.58410718
## V4 6.93924427
## V5 2.03104094
## V6 0.07947642
## V7 -0.02566414
## V8 -0.11007435
## V9 -0.08839463
## V10 -0.00715093
## duplicate1 3.56411581
From the above I can see the importance of V1 decreased because of the split between V1 and duplicated1, its because the trees in the random forest are more likely to pick V1 for a split than to pick duplicated1 because they are very correlated. therefore the total numbers of the splits that belonged to V1 is shared with duplicated1.
cforest_model <- cforest(y~., data= simulated)
# Unconditional importance measure
varimp(cforest_model) %>% sort(decreasing = T)
## V4 V1 V2 duplicate1 V5 V3
## 7.05287555 6.81916581 5.76581501 4.61092783 2.23557425 0.24301265
## V7 V6 V9 V8 V10
## 0.23570674 0.05463455 -0.07485005 -0.16012095 -0.23819254
# Conditional importance measure
varimp(cforest_model, conditional=T) %>% sort(decreasing = T)
## V4 V2 V1 duplicate1 V5 V3
## 5.90953963 4.88997499 3.97578757 1.80565856 1.54768911 0.04714052
## V7 V9 V6 V8 V10
## -0.08006441 -0.18315215 -0.19461785 -0.26034939 -0.35914281
From the above tables I can see the V4 and V2 are rated first and second in importance. The importance of the conditional importance measures are slightly different than the traditional measurement. The uninformative predictors (V6 ~ V10) are still low importance, while the importance of the other predictor are reduced, especially the V1 and duplicated1 variables.
I will start by using the gradiant boosted trees model.
gbmmodel <- gbm(y~., data = simulated, distribution = "gaussian")
summary(gbmmodel)
## var rel.inf
## V4 V4 29.1007123
## V1 V1 25.4265069
## V2 V2 24.4644959
## V5 V5 10.5766499
## V3 V3 7.2246950
## duplicate1 duplicate1 2.5932312
## V6 V6 0.3390871
## V9 V9 0.2746217
## V7 V7 0.0000000
## V8 V8 0.0000000
## V10 V10 0.0000000
From the table above I can see that the same pattern occurs, (V6-V10) are still very low, and V4 and V2 still are on the top, the grenadian boosted shows that duplicate1 is more important than V1, which is different from the models before.its seems trees from boosting are dependent on each other and will have correlated structures. As a result, the magnitude of their contribution to the importance metric looks more “stretched” comparing to other tree models.
Now looking at the Cubist model
cubistmodel <- cubist(x=simulated[, -(ncol(simulated)-1)], y=simulated$y, committees=100)
varImp(cubistmodel)
## Overall
## V1 64.5
## V3 41.0
## V2 60.0
## V4 48.0
## V5 31.0
## V6 9.0
## duplicate1 6.0
## V8 2.0
## V10 0.5
## V7 0.0
## V9 0.0
This is little different while the uninformative predictors V6 ~ V10 are still very low, the top one is V1 not V4, then it puts the duplicated preditor less than V1.
Use a simulation to show tree bias with different granularities.
Single regression trees suffer from selection bias, while predictors with a higher number of distinct values (lower variance) are favored over more granular (higher variance) predictors. when the data consist a mix of informative and noise variables, then the noise data may have more splits than the informative variables, then the trees would be misleading
A predictor of v1 with 100 one values and 100 two values Y response variable is created that is equal to X1 plus a pseudo-random Gaussian variable with μ=0,σ=2. v2 is a variable with higher variance that are not related created as pseudo-random Gaussian variable X2 with μ=0,σ=4.
I will use rpart() function which makes splits based on the CART methodology. and the varImp() function then calculates the variable importance for the model.
set.seed(123)
v1 <- rep(1:2, each= 100)
y<- v1+ rnorm(200, mean = 0, sd = 4)
v2 <- rnorm(200, mean = 0, sd=4)
mydata <- data.frame(y=y, v1=v1, v2=v2)
fit <- rpart(y~ ., data = mydata)
varImp(fit)
## Overall
## v1 0.01511841
## v2 0.34894660
From the above I can see the v1 the lower variance predictor is less important than the higher variance predictor v2, the importance of v2 is way higher.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right has higher bagging fraction and learning ratio, and the fraction of the training data used to construct the decision tree also becomes higher, as the bagging fraction approaches 1, each bootstrap sample become more similar with each other, this means the predictors being more dominant.
the model of the left ahs a lower bagging rate and learning rate, the lower the learning rate the less greedy the model is, which makes it more likely to identify more predictors to be important.
The model of the left looks more more predictive and which has less bagging rate and learning rate.
When we increase the tree depth, the variable importance is more likely to be spread out over more predictors as more and more predictors are considered for tree spliting.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data(ChemicalManufacturingProcess)
set.seed(123)
cmp_pred <- as.matrix(ChemicalManufacturingProcess[,2:58])
cmp_yield <- ChemicalManufacturingProcess[,1]
train <- createDataPartition(cmp_yield, p=0.75, list = F) #create train set
train_x <- ChemicalManufacturingProcess[train, -1]
train_y <- ChemicalManufacturingProcess[train, 1]
test_x <- ChemicalManufacturingProcess[-train, -1]
test_y <- ChemicalManufacturingProcess[-train, 1]
pre_process <- c("nzv", "corr", "center", "scale", "medianImpute")
ctrl <- trainControl(method = "boot", number = 25)
Based on RMSE result, cubist shows the lowest value.
Recursive Partitioning.
set.seed(123)
rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rp_model <- train(x = train_x, y = train_y, method = "rpart2",metric = "Rsquared", tuneGrid = rpartGrid, trControl = ctrl, preProcess=pre_process)
rp_model
## CART
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.587109 0.3222359 1.262216
## 2 1.588304 0.3471138 1.260578
## 3 1.592221 0.3548035 1.257264
## 4 1.577274 0.3744462 1.237809
## 5 1.596035 0.3695650 1.259102
## 6 1.590127 0.3799986 1.248773
## 7 1.594061 0.3795646 1.250253
## 8 1.588828 0.3833099 1.239152
## 9 1.591847 0.3840905 1.239384
## 10 1.593658 0.3852311 1.240069
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 10.
Recursive Partitioning: RMSE 1.496662
Random Forest
set.seed(123)
rfGrid <- expand.grid(mtry=seq(2,38,by=3))
rf_model <- train(x = train_x, y = train_y, method = "rf", tuneGrid = rfGrid, metric = "Rsquared", importance = TRUE, trControl = ctrl,preProcess=pre_process)
rf_model
## Random Forest
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.322484 0.6228670 1.0575703
## 5 1.258284 0.6315350 1.0022734
## 8 1.233013 0.6361845 0.9780715
## 11 1.222241 0.6352190 0.9674696
## 14 1.223318 0.6284183 0.9651774
## 17 1.221142 0.6257127 0.9622421
## 20 1.219153 0.6220413 0.9578579
## 23 1.218108 0.6195639 0.9558484
## 26 1.220456 0.6150509 0.9569360
## 29 1.222664 0.6112294 0.9566539
## 32 1.226227 0.6062553 0.9562464
## 35 1.227096 0.6038911 0.9586372
## 38 1.231679 0.5993550 0.9592835
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
Generalized Boosted Regression
set.seed(300)
gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1),
n.trees=c(25,50,100,200),
shrinkage=c(0.01,0.05,0.1,0.2),
n.minobsinnode=5)
gb_model <- train(x = train_x, y = train_y,method = "gbm", metric = "Rsquared",verbose = FALSE, tuneGrid = gbmGrid, trControl = ctrl, preProcess=pre_process)
gb_model
## Stochastic Gradient Boosting
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 25 1.750868 0.4436225 1.4232258
## 0.01 1 50 1.655577 0.4723749 1.3428954
## 0.01 1 100 1.526614 0.5115909 1.2361620
## 0.01 1 200 1.387282 0.5448973 1.1180862
## 0.01 2 25 1.720034 0.4909505 1.3971250
## 0.01 2 50 1.604482 0.5129094 1.3008413
## 0.01 2 100 1.457907 0.5388108 1.1768789
## 0.01 2 200 1.320333 0.5676001 1.0537606
## 0.01 3 25 1.709386 0.5048068 1.3898578
## 0.01 3 50 1.586327 0.5187980 1.2868118
## 0.01 3 100 1.431423 0.5429917 1.1560628
## 0.01 3 200 1.296479 0.5721581 1.0305485
## 0.01 4 25 1.698272 0.5251595 1.3803802
## 0.01 4 50 1.567858 0.5429447 1.2741709
## 0.01 4 100 1.405934 0.5618872 1.1358119
## 0.01 4 200 1.280113 0.5807098 1.0165825
## 0.01 5 25 1.697737 0.5119898 1.3801486
## 0.01 5 50 1.566274 0.5345568 1.2723130
## 0.01 5 100 1.402910 0.5593929 1.1322326
## 0.01 5 200 1.273215 0.5830422 1.0078441
## 0.01 6 25 1.691419 0.5274626 1.3740537
## 0.01 6 50 1.558710 0.5410137 1.2656960
## 0.01 6 100 1.395360 0.5600297 1.1245079
## 0.01 6 200 1.268217 0.5822350 1.0007987
## 0.05 1 25 1.481059 0.5057445 1.2009515
## 0.05 1 50 1.350340 0.5444165 1.0838001
## 0.05 1 100 1.261149 0.5746823 0.9951545
## 0.05 1 200 1.230040 0.5852977 0.9564855
## 0.05 2 25 1.411503 0.5359489 1.1366300
## 0.05 2 50 1.292021 0.5710135 1.0217085
## 0.05 2 100 1.227254 0.5908100 0.9521033
## 0.05 2 200 1.190506 0.6086009 0.9128289
## 0.05 3 25 1.382928 0.5433680 1.1121320
## 0.05 3 50 1.271772 0.5715253 1.0053490
## 0.05 3 100 1.217050 0.5927125 0.9388424
## 0.05 3 200 1.193700 0.6039165 0.9100972
## 0.05 4 25 1.358562 0.5515938 1.0875376
## 0.05 4 50 1.256933 0.5753566 0.9839688
## 0.05 4 100 1.212496 0.5933630 0.9287913
## 0.05 4 200 1.190970 0.6053405 0.9028941
## 0.05 5 25 1.361084 0.5458255 1.0890699
## 0.05 5 50 1.257658 0.5758353 0.9827664
## 0.05 5 100 1.208479 0.5974218 0.9290219
## 0.05 5 200 1.183366 0.6117928 0.9004172
## 0.05 6 25 1.356040 0.5487619 1.0884287
## 0.05 6 50 1.251162 0.5766617 0.9819329
## 0.05 6 100 1.204640 0.5983575 0.9283533
## 0.05 6 200 1.184062 0.6100934 0.9039285
## 0.10 1 25 1.329125 0.5602768 1.0666733
## 0.10 1 50 1.256313 0.5813212 0.9910342
## 0.10 1 100 1.226950 0.5886404 0.9545939
## 0.10 1 200 1.213943 0.5924137 0.9324331
## 0.10 2 25 1.304996 0.5482027 1.0350321
## 0.10 2 50 1.249796 0.5703396 0.9734349
## 0.10 2 100 1.221744 0.5846132 0.9379748
## 0.10 2 200 1.207236 0.5928956 0.9181553
## 0.10 3 25 1.284438 0.5543862 1.0162224
## 0.10 3 50 1.242525 0.5726031 0.9663431
## 0.10 3 100 1.220125 0.5849813 0.9376313
## 0.10 3 200 1.205650 0.5946167 0.9221334
## 0.10 4 25 1.274906 0.5615247 1.0019898
## 0.10 4 50 1.230322 0.5782859 0.9477100
## 0.10 4 100 1.207437 0.5900834 0.9173836
## 0.10 4 200 1.203434 0.5924397 0.9086169
## 0.10 5 25 1.264447 0.5689347 0.9940740
## 0.10 5 50 1.220251 0.5885744 0.9449055
## 0.10 5 100 1.200152 0.6022066 0.9195133
## 0.10 5 200 1.192557 0.6068172 0.9103507
## 0.10 6 25 1.276518 0.5551852 0.9991284
## 0.10 6 50 1.229838 0.5791035 0.9501079
## 0.10 6 100 1.209216 0.5899568 0.9265784
## 0.10 6 200 1.200604 0.5946278 0.9161597
## 0.20 1 25 1.290425 0.5441367 1.0113265
## 0.20 1 50 1.268703 0.5519874 0.9848023
## 0.20 1 100 1.260119 0.5592280 0.9735457
## 0.20 1 200 1.258171 0.5621760 0.9641298
## 0.20 2 25 1.284798 0.5440384 1.0076286
## 0.20 2 50 1.273004 0.5531786 0.9822348
## 0.20 2 100 1.261320 0.5619601 0.9670267
## 0.20 2 200 1.260892 0.5642607 0.9614738
## 0.20 3 25 1.278338 0.5464713 0.9887245
## 0.20 3 50 1.248787 0.5660209 0.9528273
## 0.20 3 100 1.233801 0.5762532 0.9380424
## 0.20 3 200 1.232703 0.5772614 0.9354164
## 0.20 4 25 1.264492 0.5534458 0.9796558
## 0.20 4 50 1.241729 0.5640622 0.9516062
## 0.20 4 100 1.233250 0.5715214 0.9428358
## 0.20 4 200 1.233453 0.5718412 0.9417097
## 0.20 5 25 1.240226 0.5667739 0.9559535
## 0.20 5 50 1.228031 0.5756949 0.9381049
## 0.20 5 100 1.226838 0.5778495 0.9343807
## 0.20 5 200 1.227169 0.5781840 0.9329630
## 0.20 6 25 1.263097 0.5566761 0.9682854
## 0.20 6 50 1.251723 0.5640812 0.9549264
## 0.20 6 100 1.250546 0.5666356 0.9511011
## 0.20 6 200 1.250309 0.5673331 0.9484911
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 5, shrinkage = 0.05 and n.minobsinnode = 5.
Cubist
set.seed(300)
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100),
neighbors = c(0, 1, 3, 5, 7))
cubist_model <- train(x = train_x, y = train_y,method = "cubist",
verbose = FALSE, metric = "Rsquared", tuneGrid = cubistGrid,trControl = ctrl, preProcess=pre_process)
cubist_model
## Cubist
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47), remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.783484 0.3447336 1.3008054
## 1 1 1.760482 0.3695971 1.2533727
## 1 3 1.749436 0.3673567 1.2596560
## 1 5 1.763659 0.3589877 1.2719841
## 1 7 1.774063 0.3518269 1.2839886
## 5 0 1.408501 0.4751252 1.0681905
## 5 1 1.370121 0.5084215 1.0178053
## 5 3 1.371397 0.5013727 1.0294790
## 5 5 1.384226 0.4938183 1.0376905
## 5 7 1.394223 0.4876719 1.0489456
## 10 0 1.304946 0.5319941 0.9931924
## 10 1 1.258686 0.5671439 0.9394749
## 10 3 1.264749 0.5595739 0.9533675
## 10 5 1.280093 0.5503698 0.9638110
## 10 7 1.290437 0.5436020 0.9749352
## 20 0 1.237286 0.5751213 0.9499917
## 20 1 1.192742 0.6059267 0.9006055
## 20 3 1.199731 0.5995782 0.9129920
## 20 5 1.214937 0.5910349 0.9254097
## 20 7 1.224631 0.5847037 0.9356405
## 50 0 1.185544 0.6091600 0.9103050
## 50 1 1.129865 0.6430393 0.8499082
## 50 3 1.142375 0.6350705 0.8664559
## 50 5 1.161150 0.6245744 0.8811305
## 50 7 1.172664 0.6176112 0.8928533
## 100 0 1.173871 0.6170264 0.8979396
## 100 1 1.116110 0.6513634 0.8336709
## 100 3 1.129138 0.6433284 0.8508466
## 100 5 1.148746 0.6325300 0.8665395
## 100 7 1.160761 0.6253304 0.8792325
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 100 and neighbors = 1.
cubist_imp <- varImp(cubist_model, scale = FALSE)
plot(cubist_imp, top=15, scales = list(y = list(cex = 0.8)))
We can see from the cubist model our selected as the optial model, ManufacturingProcess32 is the most important, then there are few ManufacturingProcess from the plot above, When compared to variable importance from the most optimal linear and non linear models, ManufacturingProcess32 is still within the top five most important predictors.
Here using the "partykit" library
plot(as.party(rp_model$finalModel),gp=gpar(fontsize=10))
From the plot I can see Manufacturing Process 32 is the root node, and when its greater then this results in an increased yield.
Manufacturing processes 32 and 13 are on the top, with higher values of process 32 being associated with larger yields. Lower values of process 32 are associated with smaller yields. However, a lower values of process 32 may be counter-acted with a corresponding lower value of Manufacturing process 13.