Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The random forest model dd not use the uninformative predictors significantly.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
These importance scores show almost similar patterns as the traditional random forest model. In the first model, V1 was considered the most important and V4 was the second most important. In the cforest model, these variables switch places.
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4## V1 V2 V3 V4 V5 V6
## 6.35767602 5.33898936 0.14934113 6.13948255 1.38546210 -0.22927142
## V7 V8 V9 V10
## -0.12173581 -0.18135897 -0.02920453 -0.20105653
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur
## var rel.inf
## V4 V4 29.8141647
## V2 V2 24.4292714
## V1 V1 16.0407649
## V5 V5 10.6369734
## duplicate1 duplicate1 9.8929397
## V3 V3 8.2372162
## duplicate2 duplicate2 0.4203948
## V6 V6 0.3658364
## V8 V8 0.1624385
## V7 V7 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
I still see the same pattern with V4, V2, duplicate1 and V1 being the important predictors while V6 - V9 are not the important predictors. I think the only difference between gbm model and previous ones are the importance got even more increased but the patterns is almost same. Now let’s test Cubist and see if the pattern persists here too.
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate2 0
Here seems like the pattern slightly changed as now V1, V2, V4, V5 and Duplicate1 are the top predictors while the others are not.
Use a simulation to show tree bias with different granularities
set.seed(624)
a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)
y <- a + b + c + d + e
simData <- data.frame(a,b,c,d,e,y)
rpartTree <- rpart(y ~ ., data = simData)
rpart_m <- rpart(y ~., data=simData)
plot(as.party(rpartTree), gp = gpar(fontsize = 7))## Overall
## a 1.702702
## b 4.769899
## c 3.551946
## d 3.205420
## e 3.957698
THe order of importance matches with last model as X1 - X4 are the important predictors using rpart. X1 is used more than X4 to split and hence we can say that there may be selection bias in the tree model. Overall, X1 is a highest important predictor as its value is significantly higher than X2-X4.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
## a Why does the model on the
right focus its importance on just the first few of predictors, whereas
the model on the left spreads importance across more predictors?
The bagging fraction is the fraction of the training data used, whereas, the learning rate is the fraction of the current predicted value that is added to the previous iteration’s predicted value. A lower learning rate is optimal, as it means there are more iterations. The model on the left has a smaller learning rate and bagging fraction, which means it learns slower and takes more computation time, thus performing better. It also uses a smaller subset of the data. The model on the right is most likely overfitting since it has a higher learning rate and bagging fraction. It uses more of the training data with each iteration and is learning faster. Since it is learning faster, it increases the weight or contribution of each predictor, hence focuses its importance on just the first few predictors. ## B Which model do you think would be more predictive of other samples?
The model on the left would be more predictive of other samples, as there are more iterations, thus decreasing the weight of each predictor. It generalizes better, making it more accurate.
Interaction depth is the number of splits to perform on a tree, or the maximum nodes per tree. When the interaction depth increases, the importance of the predictors increases, allowing the smaller important predictors to contribute more. Hence, the slope would become steeper or increase.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data(ChemicalManufacturingProcess)
cmp_impute <- preProcess(ChemicalManufacturingProcess[,-c(1)], method=c('bagImpute'))
# Replacing
cmp <- predict(cmp_impute, ChemicalManufacturingProcess[,-c(1)])
# Splitting the data into training and test datasets
set.seed(3559)
train_r <- createDataPartition(ChemicalManufacturingProcess$Yield, p=0.8, list=FALSE)
X_train <- cmp[train_r,]
y_train <- ChemicalManufacturingProcess$Yield[train_r]
X_test <- cmp[-train_r,]
y_test <- ChemicalManufacturingProcess$Yield[-train_r]set.seed(2330)
rpart_m <- train(x= X_train, y= y_train, method="rpart", tuneLength=10, control= rpart.control(maxdepth=2))## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## CART
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.008448771 1.531011 0.3633249 1.192349
## 0.026985533 1.531011 0.3633249 1.192349
## 0.028078955 1.531011 0.3633249 1.192349
## 0.028155890 1.531011 0.3633249 1.192349
## 0.031178734 1.531011 0.3633249 1.192349
## 0.041325211 1.537862 0.3582518 1.197905
## 0.053936668 1.536746 0.3578463 1.196370
## 0.071685404 1.534111 0.3596184 1.191464
## 0.077778984 1.540440 0.3524684 1.194357
## 0.402184913 1.708874 0.3075143 1.361557
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.03117873.
## Random Forest
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.357513 0.5865459 1.0841770
## 8 1.261666 0.6180566 0.9930581
## 14 1.246031 0.6195084 0.9752118
## 20 1.248092 0.6098999 0.9704285
## 26 1.250049 0.6023135 0.9667885
## 32 1.256488 0.5933266 0.9700237
## 38 1.263279 0.5860354 0.9709287
## 44 1.267245 0.5806719 0.9721596
## 50 1.281082 0.5697832 0.9806418
## 57 1.285406 0.5648538 0.9830251
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 14.
set.seed(19828)
grid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.5),
n.minobsinnode=c(5, 10, 15))
gbm_m <- train(x = X_train,y = y_train, method = 'gbm',tuneGrid = grid, verbose = FALSE)## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 7: BiologicalMaterial07 has no variation.
## n.trees interaction.depth shrinkage n.minobsinnode
## 76 200 10 0.1 5
set.seed(100)
cubistTuned <- train(X_train, y_train,
method = "cubist")
cubistPred <- predict(cubistTuned, X_train)
postResample(cubistPred, y_train)## RMSE Rsquared MAE
## 0.1778292 0.9924303 0.1350261
Which tree-based regression model gives the optimal resampling and test set performance
summary(resamples(list(Single_True = rpart_m, Random_Forest = randomf_m, Gradient_Boosting=gbm_m, Cubist = cubistTuned)))##
## Call:
## summary.resamples(object = resamples(list(Single_True = rpart_m,
## Random_Forest = randomf_m, Gradient_Boosting = gbm_m, Cubist = cubistTuned)))
##
## Models: Single_True, Random_Forest, Gradient_Boosting, Cubist
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.9444254 1.1065223 1.1844621 1.1923487 1.2964423 1.400361
## Random_Forest 0.6937116 0.9214488 0.9832082 0.9752118 1.0181030 1.215950
## Gradient_Boosting 0.7354389 0.8622051 0.9133852 0.9176087 0.9611315 1.120374
## Cubist 0.6964094 0.7935556 0.8785461 0.8803166 0.9345770 1.090198
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
## Cubist 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Single_True 1.2232769 1.404920 1.554384 1.531011 1.627351 1.797655 0
## Random_Forest 0.8990020 1.142196 1.263823 1.246031 1.340165 1.592363 0
## Gradient_Boosting 0.9808989 1.110621 1.168824 1.177305 1.220824 1.409576 0
## Cubist 0.8324293 1.075671 1.150489 1.157567 1.245419 1.492862 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Single_True 0.2166224 0.2787529 0.3634825 0.3633249 0.4349596 0.6043007
## Random_Forest 0.4117737 0.5772708 0.6265109 0.6195084 0.6742267 0.7068351
## Gradient_Boosting 0.5131900 0.5714078 0.6173689 0.6193288 0.6596906 0.7327710
## Cubist 0.4128364 0.5707274 0.6326029 0.6166150 0.6830906 0.7875603
## NA's
## Single_True 0
## Random_Forest 0
## Gradient_Boosting 0
## Cubist 0
Gradient Boosting is doing better as it’s RMSE value is better than the others.
Perform the test on the test set.
test_perf <- function(models, testData, testTarget) {
method <- c()
res <- data.frame()
for(model in models){
method <- c(method, model$method)
pred <- predict(model, newdata=testData)
res <- rbind(res, t(postResample(pred=pred, obs=testTarget)))
}
row.names(res) <- method
return(res)
}
models <- list(rpart_m, randomf_m, gbm_m, cubistTuned)
performance <- test_perf(models, X_test, y_test)
performance## RMSE Rsquared MAE
## rpart 1.3408087 0.3804309 1.1170494
## rf 0.9768228 0.6534475 0.7017081
## gbm 0.9865217 0.6505681 0.6990646
## cubist 0.9215937 0.7004253 0.7143488
Random forest performed slightly better than Gradient Boosting as it’s value dropped on test performance. Random Forest is consistent across the training and test dataset and I would choose Random forest.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 39.661
## BiologicalMaterial03 35.566
## BiologicalMaterial12 34.529
## BiologicalMaterial06 28.473
## ManufacturingProcess09 26.066
## ManufacturingProcess17 23.584
## ManufacturingProcess28 21.755
## ManufacturingProcess06 18.692
## ManufacturingProcess31 16.636
## BiologicalMaterial02 15.598
## BiologicalMaterial11 15.042
## BiologicalMaterial04 12.363
## BiologicalMaterial09 10.890
## ManufacturingProcess36 10.506
## BiologicalMaterial08 9.648
## BiologicalMaterial05 8.754
## BiologicalMaterial01 8.183
## ManufacturingProcess11 7.465
## ManufacturingProcess27 6.937
ManufacturingProcess32 is consistently an important predictor leading with BiologicalMaterial which is slightly consistent with previously tested linear and non-linear regression models. I would say none of them are dominating but they both are important predictors in the model.
Plot the optimal single true with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
## n= 144
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 144 510.06710 40.17806
## 2) ManufacturingProcess32< 159.5 81 137.14040 39.12543
## 4) BiologicalMaterial11< 145.47 44 50.63688 38.50932 *
## 5) BiologicalMaterial11>=145.47 37 49.93917 39.85811 *
## 3) ManufacturingProcess32>=159.5 63 167.78540 41.53143
## 6) ManufacturingProcess06< 208.1 37 81.27967 40.86622 *
## 7) ManufacturingProcess06>=208.1 26 46.83320 42.47808 *