Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 - V10)? ##### No, the model did not used them, very low (negative) importance compared to V1-V5
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
library(party)
cfModel <- cforest(y ~ ., data=simulated)
cfImp <- varimp(cfModel)
cfImp[order(-cfImp)]
## V4 V2 duplicate1 V1 V5
## 7.6223892727 6.0579730772 5.0941897280 4.6171158805 1.7161194047
## V7 V9 V3 V6 V10
## 0.0465374951 0.0046062409 0.0003116115 -0.0289427183 -0.0310326410
## V8
## -0.0380965511
## V4 V2 duplicate1 V1 V5
## 6.175211816 4.806626676 2.039771619 1.894092913 1.028543061
## V6 V3 V9 V7 V10
## 0.021733840 0.020998486 0.003766844 -0.006026025 -0.015318278
## V8
## -0.022310557
Use a simulation to show tree bias with different granularities
The bias-variance tradeoff does depend on the depth of the tree. Decision tree is sensitive to where it splits and how it splits. Therefore, even small changes in input variable values might result in very different tree structure. Tree models suffer from selection bias where predictors having higher number of distinct values are favored over more granular predictors
#Predictors
set.seed(200)
V1 <- sample(0:10 / 10, 200, replace = T)
V2 <- sample(0:100 / 100, 200, replace = T)
V3 <- sample(0:1000 / 1000, 200, replace = T)
V4 <- sample(0:10000 / 10000, 200, replace = T)
# Target variable
y <- V1 + V4
simdata <- data.frame(y, V1,V2,V3,V4)
str(simdata)
## 'data.frame': 200 obs. of 5 variables:
## $ y : num 0.682 1.269 0.938 0.767 1.503 ...
## $ V1: num 0.5 0.6 0.6 0.7 0.7 0.9 0.7 0.1 0.5 0.2 ...
## $ V2: num 0.65 0.44 0.59 0.22 0.82 0.39 0.11 0.85 0.25 0.43 ...
## $ V3: num 0.851 0.673 0.41 0.033 0.717 0.646 0.768 0.153 0.285 0.781 ...
## $ V4: num 0.1816 0.6693 0.3381 0.0669 0.8033 ...
## Overall
## V1 4.2432046
## V2 0.6640285
## V3 0.5161309
## V4 4.3187553
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
(A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.)
#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
library(caret)
chmp <- ChemicalManufacturingProcess
predictors <- chmp[,-1]
# Correlations - Find most correlated predictors
corr <- cor(predictors, use='complete.obs')
topcorr <- findCorrelation(corr)
# Zero to Near-zero variance predictors check
nzv <- nearZeroVar(predictors)
# Final predictors to be considered for modeling (minus top correlated and near-zero variance)
predictors <- predictors[,-c(nzv, topcorr)]
yield <- as.data.frame(chmp[,1])
colnames(yield) <- c("yield")
# Splitting Train and Test datasets
library(caret)
set.seed(500)
train <- createDataPartition(yield$yield, p = 0.7, list = FALSE)
trainPredictors <- predictors[train,]
trainYield <- yield[train,]
testPredictors <- predictors[-train,]
testYield <- yield[-train,]
# Pre-processing
transtrain <- preProcess(trainPredictors, method=c("BoxCox","center","scale", "knnImpute"))
transtest <- preProcess(testPredictors, method=c("BoxCox","center","scale", "knnImpute"))
transTrainPredictors <- predict(transtrain,trainPredictors)
transTestPredictors <- predict(transtest,testPredictors)
# Single-Tree
rpartModel <- train(x = transTrainPredictors,
y = trainYield,
method = "rpart",
tuneLength = 10,
control = rpart.control(maxdepth=3))
rpartModel
## CART
##
## 124 samples
## 47 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.00000000 1.582253 0.3333315 1.242090
## 0.01546150 1.582442 0.3327918 1.242983
## 0.01637835 1.581506 0.3336337 1.242777
## 0.01790651 1.582902 0.3330980 1.244640
## 0.03399488 1.586807 0.3248441 1.255272
## 0.03837772 1.589153 0.3203555 1.261351
## 0.04825734 1.578079 0.3207319 1.251510
## 0.07824823 1.580359 0.3093754 1.263354
## 0.10791698 1.601078 0.2793171 1.280952
## 0.37106585 1.635747 0.2714240 1.344364
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.04825734.
## Bagged CART
##
## 124 samples
## 47 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.284615 0.5033861 1.020088
# Random Forest
rfModel <- train(x = transTrainPredictors,
y = trainYield,
method = "cforest",
tuneLength = 10)
#modelLookup("cforest")
rfModel
## Conditional Inference Random Forest
##
## 124 samples
## 47 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.525430 0.4373419 1.237979
## 7 1.367476 0.4854817 1.097711
## 12 1.341693 0.4906098 1.070490
## 17 1.333407 0.4909909 1.059084
## 22 1.333748 0.4863332 1.058292
## 27 1.333708 0.4836690 1.057417
## 32 1.336717 0.4789318 1.057632
## 37 1.339898 0.4767046 1.060028
## 42 1.347497 0.4696276 1.063856
## 47 1.358785 0.4614083 1.072335
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.
# Boosted
boostgrid <- expand.grid(n.trees=c(50, 100, 150, 200),
interaction.depth=c(1, 5, 10, 15),
shrinkage=c(0.01, 0.1, 0.3, 0.5),
n.minobsinnode=c(5, 10, 15, 20))
gbmModel <- train(x = transTrainPredictors,
y = trainYield,
method = 'gbm',
tuneGrid = boostgrid,
verbose = FALSE)
plot(gbmModel)
## n.trees interaction.depth shrinkage n.minobsinnode
## 100 200 10 0.1 5
## A gradient boosted model with gaussian loss function.
## 200 iterations were performed.
## There were 47 predictors of which 47 had non-zero influence.
resampl <- resamples(list(SingleTree=rpartModel, Bagging=bagModel, RandomForest=rfModel, Boosting=gbmModel))
summary(resampl)
##
## Call:
## summary.resamples(object = resampl)
##
## Models: SingleTree, Bagging, RandomForest, Boosting
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.9529453 1.1478482 1.2601471 1.2515102 1.359692 1.494096
## Bagging 0.7237368 0.9506854 1.0157216 1.0200884 1.104488 1.345302
## RandomForest 0.8370000 0.9666872 1.0268801 1.0590838 1.150407 1.329035
## Boosting 0.7704826 0.8605392 0.9155187 0.9640081 1.072086 1.214469
## NA's
## SingleTree 0
## Bagging 0
## RandomForest 0
## Boosting 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SingleTree 1.2601942 1.450081 1.579719 1.578079 1.716539 1.925961 0
## Bagging 0.8714201 1.209174 1.308471 1.284615 1.423152 1.613547 0
## RandomForest 1.0362395 1.239562 1.295276 1.333407 1.385481 1.672938 0
## Boosting 1.0453886 1.104140 1.223885 1.248895 1.386579 1.696649 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## SingleTree 0.09777539 0.2689006 0.3618896 0.3207319 0.4010188 0.5338808
## Bagging 0.31333004 0.4286663 0.5182934 0.5033861 0.5679770 0.6611859
## RandomForest 0.31770421 0.4195910 0.4916894 0.4909909 0.5520600 0.7201505
## Boosting 0.26777362 0.4882695 0.5552660 0.5524136 0.6183363 0.7478917
## NA's
## SingleTree 0
## Bagging 0
## RandomForest 0
## Boosting 0
rpartPred <- predict(rpartModel, newdata = transTestPredictors)
rpartperf <- postResample(pred = rpartPred, obs = testYield)
bagPred <- predict(bagModel, newdata = transTestPredictors)
bagperf <- postResample(pred = bagPred, obs = testYield)
rfPred <- predict(rfModel, newdata = transTestPredictors)
rfperf <- postResample(pred = rfPred, obs = testYield)
gbmPred <- predict(gbmModel, newdata = transTestPredictors)
gbmperf <- postResample(pred = gbmPred, obs = testYield)
library(knitr)
kable(data.frame("SingleTree"=rpartperf, "Bagging"=bagperf, "RandomForest"=rfperf, "Boosting"=gbmperf))
SingleTree | Bagging | RandomForest | Boosting | |
---|---|---|---|---|
RMSE | 1.5565189 | 1.336844 | 1.3732942 | 1.1489486 |
Rsquared | 0.3846161 | 0.541170 | 0.5857616 | 0.6588613 |
MAE | 1.1476771 | 1.026317 | 1.0486082 | 0.9182695 |
## gbm variable importance
##
## only 20 most important variables shown (out of 47)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17 22.987
## BiologicalMaterial06 18.743
## ManufacturingProcess06 17.645
## ManufacturingProcess13 12.974
## ManufacturingProcess11 11.390
## ManufacturingProcess30 10.446
## BiologicalMaterial01 8.057
## ManufacturingProcess09 7.777
## ManufacturingProcess25 7.196
## BiologicalMaterial08 7.188
## ManufacturingProcess22 7.010
## BiologicalMaterial05 6.820
## ManufacturingProcess01 6.638
## BiologicalMaterial11 6.634
## BiologicalMaterial09 6.571
## ManufacturingProcess21 6.186
## ManufacturingProcess39 5.759
## ManufacturingProcess28 5.330
## ManufacturingProcess24 4.621
# Linear Model (PLS) - For Reference
ctrl <- trainControl(method = "boot", number = 25)
pls_tune <- train(x = transTrainPredictors, y = trainYield,
method = "pls",
tuneLength = 15,
trControl = ctrl)
plsImp <- varImp(pls_tune, scale = FALSE)
# Nonlinear Model (SVM) - For Reference
svmModel <- train(x = transTrainPredictors,
y = trainYield,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 10)
svmImp <- varImp(svmModel)
plot(svmImp, top=10)
trainData <- data.frame(transTrainPredictors, trainYield)
less <- subset(trainData, ManufacturingProcess32 < 0.02642 & ManufacturingProcess06 < 0.248)
more <- subset(trainData, ManufacturingProcess32 >= 0.02642 & ManufacturingProcess06 > 0.248)
mean(less$trainYield)
## [1] 39.04113
## [1] 42.44591