Recreate the simulated data from Exercise 7.2
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- model1$importance #varImp(model1, scale = FALSE)
vip(model1, color = 'red', fill='green') + ggtitle('Model1 Var Imp')
Did the random forest model significantly use the uninformative predictors (V6 - V10)?
Based on the chart above, it appears that the random forest mdoel did not significantly use these variables (V6- V10).
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
Adding a highly correlated variable changes the result of the variable importance plots. V1 score changed. V1 in model1 was the most important variable; however, with the addition of duplicate 1
varitable that is highly correlated to V1, the importance score of V1 decreased. Also, V6-V10 scores also changed. They increased in importance.
library(caret)
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
vip(model2, color = 'green', fill='red') + ggtitle('Model2 Var Imp')
cforest
function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?When variable importance is conditional, it considers correlation between variable v1 and duplicate1, which then adjusts the importance score accordingly. With unconditional approach, V1 and duplicate1 are treated with same importance. Patterns I see don’t seem to be same pattern as traditional random forest model.
library(partykit)
model3 <- cforest(y ~ ., data = simulated, ntree = 1000)
# Conditional variable importance
cfImp_cond <- varimp(model3, conditional = TRUE)
# Un-conditional variable importance
cfImp_uncond <- varimp(model3, conditional = FALSE)
Conditional
barplot(sort(cfImp_cond),horiz = TRUE, main = 'Conditional')
Unconditional
barplot(sort(cfImp_uncond),horiz = TRUE, main = 'Unconditional')
library(Cubist)
library(caret)
model4 <- cubist(x = simulated[, names(simulated)[names(simulated) != 'y']], y = simulated[,c('y')])
# Conditional variable importance
cImp_cond <- varImp(model4, conditional = TRUE)
# Un-conditional variable importance
cImp_uncond <- varImp(model4, conditional = FALSE)
For cubist, the patterns are the same for conditional and unconditional.
barplot((t(cImp_cond)),horiz = TRUE, main = 'Conditional')
barplot((t(cImp_uncond)),horiz = TRUE, main = 'Unconditional')
Use a simulation to show tree bias with different granularities.
V1 <- runif(1000, 2,1000)
V2 <- runif(1000, 50,500)
V3 <- rnorm(1000, 500,10)
y <- V2 + V1
df <- data.frame(V1, V2, V3, y)
model5 <- cforest(y ~ ., data = df, ntree = 10)
#unconditional
cfImp_cond <- varimp(model5, conditional = FALSE)
With V1 havaing more distinct values, the random forest gives this variable the highest score. V2 has a higher score compared to V3 since V2 has more distinct values than V3.
barplot(sort(cfImp_cond),horiz = TRUE, main = 'Unconditional')
Figure 8.24
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9
LEFT = {bagging fraction = 0.1, learning rate = 0.1} | RIGHT = {bagging fraction = 0.9, learning rate = 0.9}
The model on the right has higher bagging fraction and learning ratio.When the bagging fraction is higher, this means that the fraction of the training data used to construct the decision tree also becomes higher. As the bagging fraction approaches 1, each bootstrap sample become more similar with each other. This leads to some predictors being more dominant.
The model on the left has a lower bagging rate and learning rate. The lower the learning rate, the less greedy the model is, which makes the model more likely to identify more predictors to be important.
The model of the left with the lower bagging rate and lower learning rate would be likely to have better performance compared to the mdoel on the right.
As we increase tree depth, the variable importance is more likely to be spread out over more predictors as more and more predictors are considered for tree splitting.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
Split train/test.
cmp_predictors = as.matrix(ChemicalManufacturingProcess[,2:58])
cmp_yield = ChemicalManufacturingProcess[,1]
set.seed(100)
train_select <- createDataPartition(cmp_yield, p=0.75, list=F) #create train set
train_x <- ChemicalManufacturingProcess[train_select,-1]
train_y <- ChemicalManufacturingProcess[train_select,1]
test_x <- ChemicalManufacturingProcess[-train_select,-1]
test_y <- ChemicalManufacturingProcess[-train_select,1]
pre_process <- c("nzv", "corr", "center","scale", "medianImpute")
set.seed(200)
ctrl <- trainControl(method = "boot", number = 25)
Based on RMSE result, cubist appears to have the lowest RMSE value.
Recursive Partitioning
set.seed(300)
rpartGrid <- expand.grid(maxdepth= seq(1,10,by=1))
rp_model <- train(x = train_x, y = train_y, method = "rpart2",metric = "Rsquared", tuneGrid = rpartGrid,
trControl = ctrl, preProcess=pre_process)
Random Forest
set.seed(300)
rfGrid <- expand.grid(mtry=seq(2,38,by=3))
rf_model <- train(x = train_x, y = train_y, method = "rf", tuneGrid = rfGrid, metric = "Rsquared", importance = TRUE,
trControl = ctrl,preProcess=pre_process)
Generalized Boosted Regression
library(gbm)
## Loaded gbm 2.1.5
set.seed(300)
gbmGrid <- expand.grid(interaction.depth=seq(1,6,by=1),
n.trees=c(25,50,100,200),
shrinkage=c(0.01,0.05,0.1,0.2),
n.minobsinnode=5)
gb_model <- train(x = train_x, y = train_y,method = "gbm", metric = "Rsquared",verbose = FALSE,
tuneGrid = gbmGrid, trControl = ctrl, preProcess=pre_process)
Cubist
set.seed(300)
cubistGrid <- expand.grid(committees = c(1, 5, 10, 20, 50, 100),
neighbors = c(0, 1, 3, 5, 7))
cubist_model <- train(x = train_x, y = train_y,method = "cubist",
verbose = FALSE, metric = "Rsquared", tuneGrid = cubistGrid,trControl = ctrl, preProcess=pre_process)
Recursive Partitioning: RMSE 1.496662
rp_model
## CART
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47),
## remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared MAE
## 1 1.446007 0.4220720 1.143513
## 2 1.502728 0.3892350 1.192220
## 3 1.494603 0.4095253 1.180323
## 4 1.494340 0.4239887 1.182846
## 5 1.496662 0.4248578 1.181303
## 6 1.520564 0.4165578 1.191895
## 7 1.520294 0.4229997 1.186132
## 8 1.525088 0.4228835 1.191509
## 9 1.530088 0.4196601 1.193589
## 10 1.538378 0.4155667 1.199175
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 5.
Random Forest: RMSE 1.200401
rf_model
## Random Forest
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47),
## remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.320376 0.5691294 1.0436938
## 5 1.242293 0.6013719 0.9720371
## 8 1.218744 0.6081985 0.9513966
## 11 1.209857 0.6090369 0.9398972
## 14 1.200401 0.6124067 0.9299892
## 17 1.199905 0.6092975 0.9313788
## 20 1.194800 0.6104796 0.9261917
## 23 1.198033 0.6077147 0.9287821
## 26 1.199683 0.6038839 0.9295573
## 29 1.200444 0.6031840 0.9290230
## 32 1.200941 0.6018622 0.9276243
## 35 1.211061 0.5935499 0.9362114
## 38 1.210580 0.5940910 0.9371074
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
Generalized Boosting: RMSE 1.194412
gb_model
## Stochastic Gradient Boosting
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47),
## remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees RMSE Rsquared MAE
## 0.01 1 25 1.713421 0.4819946 1.3571651
## 0.01 1 50 1.607853 0.4972286 1.2674166
## 0.01 1 100 1.476911 0.5221612 1.1654662
## 0.01 1 200 1.351535 0.5431167 1.0657285
## 0.01 2 25 1.689819 0.4995401 1.3406533
## 0.01 2 50 1.562149 0.5209369 1.2328571
## 0.01 2 100 1.413837 0.5388320 1.1145509
## 0.01 2 200 1.298976 0.5553711 1.0192666
## 0.01 3 25 1.680068 0.5097741 1.3337627
## 0.01 3 50 1.544032 0.5332486 1.2208632
## 0.01 3 100 1.391206 0.5444963 1.0940655
## 0.01 3 200 1.277749 0.5639034 1.0022225
## 0.01 4 25 1.672308 0.5243584 1.3282058
## 0.01 4 50 1.533669 0.5367465 1.2134469
## 0.01 4 100 1.376582 0.5529506 1.0852364
## 0.01 4 200 1.264614 0.5710770 0.9931650
## 0.01 5 25 1.669260 0.5257313 1.3273246
## 0.01 5 50 1.531100 0.5388978 1.2129025
## 0.01 5 100 1.372559 0.5535882 1.0834247
## 0.01 5 200 1.260669 0.5723870 0.9894489
## 0.01 6 25 1.663472 0.5355138 1.3231436
## 0.01 6 50 1.521538 0.5441659 1.2071121
## 0.01 6 100 1.362701 0.5587429 1.0757300
## 0.01 6 200 1.250047 0.5794039 0.9794532
## 0.05 1 25 1.431720 0.5212763 1.1296549
## 0.05 1 50 1.325706 0.5373127 1.0440116
## 0.05 1 100 1.277340 0.5492961 1.0003337
## 0.05 1 200 1.271666 0.5535652 0.9887431
## 0.05 2 25 1.381431 0.5312744 1.0855944
## 0.05 2 50 1.288525 0.5498220 1.0096878
## 0.05 2 100 1.252644 0.5648314 0.9756844
## 0.05 2 200 1.227951 0.5814545 0.9544170
## 0.05 3 25 1.340778 0.5494681 1.0566491
## 0.05 3 50 1.261921 0.5620340 0.9899643
## 0.05 3 100 1.232201 0.5791961 0.9618919
## 0.05 3 200 1.208345 0.5949999 0.9365272
## 0.05 4 25 1.335872 0.5477303 1.0539631
## 0.05 4 50 1.259873 0.5647446 0.9876622
## 0.05 4 100 1.225381 0.5845456 0.9538416
## 0.05 4 200 1.204639 0.5988190 0.9314796
## 0.05 5 25 1.333420 0.5433711 1.0533135
## 0.05 5 50 1.250932 0.5664684 0.9791900
## 0.05 5 100 1.216350 0.5863053 0.9429517
## 0.05 5 200 1.194412 0.6015103 0.9212233
## 0.05 6 25 1.326120 0.5489203 1.0478633
## 0.05 6 50 1.241881 0.5743299 0.9720595
## 0.05 6 100 1.211085 0.5908748 0.9388077
## 0.05 6 200 1.196340 0.5998556 0.9215901
## 0.10 1 25 1.337647 0.5254716 1.0498406
## 0.10 1 50 1.287789 0.5397814 1.0055440
## 0.10 1 100 1.281903 0.5450445 0.9991686
## 0.10 1 200 1.283337 0.5469108 1.0001891
## 0.10 2 25 1.290522 0.5408344 1.0120448
## 0.10 2 50 1.258719 0.5590678 0.9848854
## 0.10 2 100 1.237131 0.5752840 0.9669757
## 0.10 2 200 1.228188 0.5826838 0.9528319
## 0.10 3 25 1.292146 0.5375782 1.0147189
## 0.10 3 50 1.264751 0.5559729 0.9902035
## 0.10 3 100 1.244973 0.5722418 0.9674498
## 0.10 3 200 1.233130 0.5806102 0.9541416
## 0.10 4 25 1.275248 0.5464811 1.0032062
## 0.10 4 50 1.244531 0.5671624 0.9713187
## 0.10 4 100 1.228697 0.5793449 0.9551650
## 0.10 4 200 1.219937 0.5850983 0.9465322
## 0.10 5 25 1.250406 0.5620594 0.9787226
## 0.10 5 50 1.222669 0.5794037 0.9472771
## 0.10 5 100 1.205240 0.5928691 0.9256528
## 0.10 5 200 1.197315 0.5983532 0.9176540
## 0.10 6 25 1.253090 0.5644710 0.9775822
## 0.10 6 50 1.220784 0.5856716 0.9449732
## 0.10 6 100 1.207360 0.5948927 0.9307187
## 0.10 6 200 1.199964 0.5989132 0.9247113
## 0.20 1 25 1.302057 0.5301085 1.0234120
## 0.20 1 50 1.312754 0.5214456 1.0251724
## 0.20 1 100 1.324777 0.5194423 1.0283380
## 0.20 1 200 1.324010 0.5238115 1.0246157
## 0.20 2 25 1.289058 0.5401963 1.0043093
## 0.20 2 50 1.269608 0.5617679 0.9786454
## 0.20 2 100 1.257876 0.5711676 0.9711393
## 0.20 2 200 1.254104 0.5748248 0.9692057
## 0.20 3 25 1.291760 0.5375625 0.9966757
## 0.20 3 50 1.277594 0.5509380 0.9814454
## 0.20 3 100 1.263415 0.5604979 0.9723079
## 0.20 3 200 1.258556 0.5645535 0.9659650
## 0.20 4 25 1.289691 0.5342971 0.9943173
## 0.20 4 50 1.270047 0.5501486 0.9774601
## 0.20 4 100 1.264751 0.5552468 0.9731044
## 0.20 4 200 1.262690 0.5575242 0.9717571
## 0.20 5 25 1.275473 0.5481703 0.9926977
## 0.20 5 50 1.261985 0.5573279 0.9781802
## 0.20 5 100 1.258264 0.5607697 0.9748566
## 0.20 5 200 1.255744 0.5632469 0.9727153
## 0.20 6 25 1.247444 0.5684286 0.9682643
## 0.20 6 50 1.231271 0.5802438 0.9573174
## 0.20 6 100 1.227512 0.5829011 0.9525557
## 0.20 6 200 1.227286 0.5832301 0.9520439
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200,
## interaction.depth = 5, shrinkage = 0.05 and n.minobsinnode = 5.
Cubist: RMSE 1.123437
cubist_model
## Cubist
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (47), scaled (47), median imputation (47),
## remove (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.900356 0.3391244 1.3774697
## 1 1 1.852243 0.3783705 1.3491538
## 1 3 1.868236 0.3621745 1.3523916
## 1 5 1.876522 0.3557701 1.3553085
## 1 7 1.877604 0.3526216 1.3568618
## 5 0 1.360823 0.5274541 1.0542433
## 5 1 1.316455 0.5636903 1.0076885
## 5 3 1.326182 0.5525188 1.0238896
## 5 5 1.333516 0.5471739 1.0342985
## 5 7 1.339091 0.5435583 1.0401352
## 10 0 1.299510 0.5541688 0.9979394
## 10 1 1.248361 0.5897690 0.9448069
## 10 3 1.263809 0.5777453 0.9637459
## 10 5 1.273117 0.5719587 0.9741553
## 10 7 1.279562 0.5681643 0.9799410
## 20 0 1.247859 0.5784621 0.9687138
## 20 1 1.187194 0.6178969 0.9038855
## 20 3 1.204245 0.6063998 0.9282984
## 20 5 1.218000 0.5979068 0.9424389
## 20 7 1.224935 0.5936971 0.9479289
## 50 0 1.188141 0.6117330 0.9181411
## 50 1 1.128215 0.6481615 0.8572959
## 50 3 1.144807 0.6390132 0.8771819
## 50 5 1.158805 0.6302804 0.8914944
## 50 7 1.165503 0.6262431 0.8962858
## 100 0 1.181725 0.6160486 0.9109076
## 100 1 1.123671 0.6505370 0.8493752
## 100 3 1.139919 0.6421493 0.8696818
## 100 5 1.153927 0.6335253 0.8835815
## 100 7 1.160370 0.6297168 0.8904697
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were committees = 100 and neighbors
## = 1.
The top 10 important predictors as per cubist model are ManufacturingProcess32 ManufacturingProcess13 ManufacturingProcess09 ManufacturingProcess17 BiologicalMaterial06 BiologicalMaterial03 BiologicalMaterial11 ManufacturingProcess33 ManufacturingProcess39 BiologicalMaterial09.
cubist_imp <- varImp(cubist_model, scale = FALSE)
plot(cubist_imp, top=15, scales = list(y = list(cex = 0.8)))
As expected, the tree starts with splitting the data at the very top of the tree with the most important variable ManufacturingProcess32.
plot(as.party(rp_model$finalModel),gp=gpar(fontsize=10))