library(tidyverse)
library(mlbench)
library(caret)
library(Cubist)
library(gbm)
library(ipred)
library(party)
library(partykit)
library(randomForest)
library(rpart)
library(RWeka)
library(AppliedPredictiveModeling)
library(rattle)
Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10
)?
No. Their significance are extremely small.
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1
change? What happens when you add another predictor that is also highly correlated with V1
?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
The importance score for V1
fell. So did the score for the other variables. Multicollinearity ususally affects the ranking of variable importance as it may be difficult to score their importance if they are technically the same. V4
is now the most important predictor in this case.
cforest
function in the party package to fit a random forest model using conditional inference trees. The party package function varimp
can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?model2 <- party:: cforest(y ~ ., data = simulated,
controls=cforest_control(mtry=(ncol(simulated)-1)))
party::varimp(model2)
## V1 V2 V3 V4 V5 V6
## 3.887369983 7.744416028 0.050926368 10.241797325 2.064600935 -0.018782373
## V7 V8 V9 V10 duplicate1
## 0.029094384 -0.023311385 0.009101122 0.026801558 5.756908601
With theconditional = FALSE
, the importance scores follow the same pattern as the tradional random forest.
## V1 V2 V3 V4 V5 V6
## 0.555456266 4.725776440 0.007057617 6.054858562 0.698371838 -0.008996735
## V7 V8 V9 V10 duplicate1
## 0.015960555 -0.008461915 0.005040511 0.023136037 0.864870599
With conditional = TRUE
, the scores are even smaller however some of the less significant predictor values did improve a bit.
On a whole (V1-V5
) still remain most important in both case so they indeed follow the pattern of the ransom forest.
Boosted
## var rel.inf
## V4 V4 28.1586654
## V2 V2 21.9983634
## duplicate1 duplicate1 15.8249555
## V1 V1 15.1558423
## V5 V5 9.9637365
## V3 V3 8.0879655
## V6 V6 0.8104713
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
Cubist
cubist_model <- cubist(simulated[, c(1:10, 12)], simulated$y, committees = 100)
varImp(cubist_model)
## Overall
## V3 43.5
## V1 52.5
## V2 59.5
## duplicate1 27.5
## V4 46.0
## V8 4.0
## V5 27.0
## V6 10.0
## V10 1.0
## V7 0.0
## V9 0.0
Only Cubist has variable V2
as the most important. V6
is which is uninformative in the previous models is considered significant in the Cubist model.
Use a simulation to show tree bias with different granularities.
set.seed(300)
x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)
set.seed(300)
zy <- (.4 * x1) + (.2 * x2) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * zy) + 10
simulated2 <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)
rpartfit <- rpart(y ~., data = simulated2)
varImp(rpartfit)
## Overall
## x1 3.556750
## x2 1.823912
## x3 1.755870
According to the text pg 182
, regression trees suffer from selection bias: predictors with a higher number of distinct values, that is, with lower variance are favored over more granular predictors which has higher variance. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Bagging fraction parameter - is the fraction of randomly sampled observations from the training set)
Shrinkage parameter - is known as the learning rate. The larger the number, the faster the learning rate.
Higher shrinking parameter means you will converge faster, thus taking larger steps down the gradient descent which may cause the algorithm to miss the optimal point and eventually overfit.
The learning rate of 0.1 is better since it is slower and the importance spreads out over more predictors. Small incremental steps down the gradient descent appears to work best.
data(solubility)
gbmGrid1 <- expand.grid(.interaction.depth = 1,
.n.trees = 100,
.shrinkage = 0.1,
.n.minobsinnode=10)
set.seed(100)
gbmTune1 <- train(solTrainXtrans,
solTrainY,
method = "gbm",
tuneGrid = gbmGrid1,
verbose = FALSE)
plot(varImp(gbmTune1), top = 30)
gbmGrid2 <- expand.grid(.interaction.depth = 20,
.n.trees = 100,
.shrinkage = 0.1,
.n.minobsinnode=10)
set.seed(100)
gbmTune2 <- train(solTrainXtrans,
solTrainY,
method = "gbm",
tuneGrid = gbmGrid2,
verbose = FALSE)
plot(varImp(gbmTune2), top = 30)
Looking at the two plots, we can see that the increase in the interaction.depth
paramenter helps to spread out the importance more among the predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv"))
Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
yield <- as.matrix(ChemicalManufacturingProcess$Yield)
set.seed(789)
split2 <- yield %>%
createDataPartition(p = 0.8, list = FALSE, times = 1)
Xtrain.data <- Xpreprocess[split2, ] #chem train
xtest.data <- Xpreprocess[-split2, ] #chem test
Ytrain.data <- yield[split2, ] #yield train
ytest.data <- yield[-split2, ] #yield test
RMSE = c(min(chem_rpart_tuned$results$RMSE), min(chem_m5_tuned$results$RMSE), min(chem_bagged_tuned$results$RMSE),min(chem_rf_tuned$results$RMSE),min(chem_gbm_tuned$results$RMSE), min(chem_cubist_tuned$results$RMSE))
Rsquared = c(max(chem_rpart_tuned$results$Rsquared), max(chem_m5_tuned$results$Rsquared), max(chem_bagged_tuned$results$Rsquared), max(chem_rf_tuned$results$Rsquared), max(chem_gbm_tuned$results$Rsquared), max(chem_cubist_tuned$results$Rsquared))
MAE = c(min(chem_rpart_tuned$results$MAE), min(chem_m5_tuned$results$MAE), min(chem_bagged_tuned$results$MAE), min(chem_rf_tuned$results$MAE), min(chem_gbm_tuned$results$MAE), min(chem_cubist_tuned$results$MAE))
results <- cbind(RMSE, Rsquared, MAE) %>% data.frame(row.names = c("RPART", "M5", "BAG", "RF", "GBM", "CUBIST"))
kableExtra::kable(results) %>% kableExtra::kable_styling(bootstrap_options = "striped")
RMSE | Rsquared | MAE | |
---|---|---|---|
RPART | 1.334645 | 0.5054817 | 1.0618190 |
M5 | 1.398013 | 0.4903795 | 1.0569324 |
BAG | 1.192763 | 0.6121492 | 0.9418477 |
RF | 1.126824 | 0.6742688 | 0.8929892 |
GBM | 1.264458 | 0.5518754 | 0.9923112 |
CUBIST | 1.100763 | 0.6565921 | 0.8554157 |
rpart_pred <- predict(chem_rpart_tuned, newdata = xtest.data)
rpartpv <- postResample(pred = rpart_pred, obs = ytest.data)
m5_pred <- predict(chem_m5_tuned, newdata = xtest.data)
m5pv <- postResample(pred = m5_pred, obs = ytest.data)
bagged_pred <- predict(chem_bagged_tuned, newdata = xtest.data)
baggedpv <- postResample(pred = bagged_pred, obs = ytest.data)
rf_pred <- predict(chem_rf_tuned, newdata = xtest.data)
rfpv <- postResample(pred = rf_pred, obs = ytest.data)
gbm_pred <- predict(chem_gbm_tuned, newdata = xtest.data)
gbmpv <- postResample(pred = gbm_pred, obs = ytest.data)
cubist_pred <- predict(chem_cubist_tuned, newdata = xtest.data)
cubistpv <- postResample(pred = cubist_pred, obs = ytest.data)
data.frame(rpartpv, m5pv, baggedpv, rfpv, gbmpv, cubistpv) %>% kableExtra::kable() %>% kableExtra::kable_styling(bootstrap_options = "striped")
rpartpv | m5pv | baggedpv | rfpv | gbmpv | cubistpv | |
---|---|---|---|---|---|---|
RMSE | 1.4975873 | 1.2776163 | 1.2846622 | 1.0746261 | 1.2134944 | 0.9541025 |
Rsquared | 0.2261018 | 0.4002705 | 0.4053841 | 0.5644691 | 0.4802251 | 0.6613366 |
MAE | 1.1315189 | 0.9872927 | 0.9218436 | 0.8051689 | 0.8915011 | 0.7247069 |
gives the optimal resampling and test set performance?
From the looks of the table above, the cubist
model outperforms the other models on both training and test set due to it having the lowest RMSE score.
## Cubist
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.654406 0.4227947 1.1951170
## 1 5 1.630817 0.4359900 1.1694179
## 1 9 1.639838 0.4320087 1.1753534
## 10 0 1.206217 0.5960610 0.9349341
## 10 5 1.180251 0.6128340 0.9062233
## 10 9 1.191380 0.6057719 0.9181023
## 20 0 1.126936 0.6401021 0.8842087
## 20 5 1.100763 0.6565921 0.8554157
## 20 9 1.112568 0.6494242 0.8663721
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
## committees neighbors
## 8 20 5
##
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
##
## Number of samples: 144
## Number of predictors: 56
##
## Number of committees: 20
## Number of rules per committee: 2, 1, 1, 1, 7, 1, 6, 1, 6, 1, 4, 1, 1, 1, 7, 1, 3, 1, 4, 1
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## -0.1070431 -0.04306519 -0.1062217
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## -0.07283723 -0.004683137 -0.09620684
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 0.06681 -0.04830923 -0.1178766
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## -0.1012727 -0.03863564 0.1056672
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 0.5096271 0.1087038 0.3424324
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## -0.06529069 -0.2229103 0.4390925
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 0.8941637 0.1066231 -0.1030952
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 0.02020002 -0.4806937 -0.007834829
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 0.04826216 -0.09295527 0.06169755
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 0.005007187 0.06617593 -0.1360039
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 0.0688801 -0.1744786 -0.1218132
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## -0.01031118 -0.1438567 0.0651293
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 0.06432695 0.06918722 0.7255096
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## -0.0066778 0.03954225 0.09273307
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## -0.08632349 0.1836771 0.1182687
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## -0.05513017 0.4884872 -0.03063781
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 0.7174727 0.231727 -0.4626528
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## -0.4405878 0.2027957 -0.1289558
## ManufacturingProcess44 ManufacturingProcess45
## 0.2946725 0.1522024
in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32 94.667
## ManufacturingProcess09 64.000
## BiologicalMaterial03 60.000
## BiologicalMaterial02 50.667
## ManufacturingProcess13 46.667
## ManufacturingProcess04 42.667
## ManufacturingProcess33 42.667
## BiologicalMaterial12 42.667
## ManufacturingProcess39 34.667
## BiologicalMaterial06 34.667
## ManufacturingProcess15 26.667
## ManufacturingProcess29 25.333
## ManufacturingProcess37 20.000
## ManufacturingProcess02 18.667
## BiologicalMaterial08 18.667
## ManufacturingProcess26 14.667
## ManufacturingProcess27 14.667
## ManufacturingProcess31 13.333
## BiologicalMaterial09 9.333
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(101)
ridgeRegFit <- train(Xtrain.data, Ytrain.data, method = "ridge", tuneGrid = ridgeGrid, trControl = trainControl(method = "cv", number = 10))
predictions <- ridgeRegFit %>% predict(xtest.data)
cbind(
RMSE = RMSE(predictions, ytest.data),
R_squared = caret::R2(predictions, ytest.data)
)
## RMSE R_squared
## [1,] 1.06489 0.5657402
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess32 97.82
## ManufacturingProcess17 86.84
## BiologicalMaterial06 83.64
## BiologicalMaterial03 78.13
## ManufacturingProcess09 72.37
## BiologicalMaterial12 72.20
## ManufacturingProcess36 70.51
## BiologicalMaterial02 63.10
## ManufacturingProcess06 61.88
## ManufacturingProcess31 58.39
## BiologicalMaterial11 56.85
## ManufacturingProcess33 47.06
## ManufacturingProcess11 45.94
## BiologicalMaterial04 45.43
## ManufacturingProcess29 44.71
## BiologicalMaterial08 44.36
## ManufacturingProcess12 38.22
## BiologicalMaterial01 35.56
## BiologicalMaterial09 33.79
marsGrid <- expand.grid(.degree = 1:3, .nprune = 2:100)
set.seed(100)
chem_mars_tuned <- train(Xtrain.data, Ytrain.data,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv", number = 10))
marspred <- predict(chem_mars_tuned, newdata = xtest.data)
marspv <- postResample(pred = marspred, obs = ytest.data)
marspv
## RMSE Rsquared MAE
## 1.1997927 0.4779621 0.9605993
## earth variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 60.24
## ManufacturingProcess14 0.00
## ManufacturingProcess22 0.00
## BiologicalMaterial02 0.00
## ManufacturingProcess30 0.00
## ManufacturingProcess38 0.00
## ManufacturingProcess29 0.00
## BiologicalMaterial01 0.00
## BiologicalMaterial05 0.00
## ManufacturingProcess40 0.00
## ManufacturingProcess01 0.00
## BiologicalMaterial06 0.00
## BiologicalMaterial10 0.00
## ManufacturingProcess21 0.00
## ManufacturingProcess33 0.00
## ManufacturingProcess31 0.00
## ManufacturingProcess28 0.00
## ManufacturingProcess44 0.00
## ManufacturingProcess12 0.00
Notes:
All three models have the manufacturing processes as very important predictors in determining the porduct’s outcome.
A common predictor for the models is ManufacturingProcess32
in the top 3.
Only the non-linear model considered two predictors as important and they are both from the manufacturing elements. Therefore there is less spread of importance in this model among the predictors.
It seems as though tree modeling is best for this data if we were to judge based on the performance scores.
with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Yes, a graphical view of the data does provide a better understanding. You can see that the tree has several branches with details for each important predictor at some level. It also carefully depicts the relationship between the biological and process predictors and the yield.