Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No. Their significance are small.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
The importance score for V1 is lower; scores for other varibles fell as well. Multicollinearity ususally affects the ranking of variable importance as it may be difficult to score their importance if variables are highly correlated. V4 is now the most important predictor.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model2 <- cforest(y ~ ., data = simulated)
varimp(model2)
## V1 V2 V3 V4 V5 V6
## 6.14253842 6.10410006 0.10632869 6.54418720 2.07799547 -0.11606490
## V7 V8 V9 V10 duplicate1
## 0.05872350 -0.08585961 -0.08030916 -0.14555774 5.73199392
With theconditional = FALSE, the importance scores are very similar to the tradional random forest.
varimp(model2, conditional = T)
## V1 V2 V3 V4 V5 V6
## 3.22747234 5.19402511 0.07065979 5.87536300 1.29939244 -0.14118239
## V7 V8 V9 V10 duplicate1
## -0.09302146 -0.30875505 -0.05486725 -0.18846005 2.62605554
setting conditional = TRUE, the scores are smaller. However, some of the less significant predictors did improve a bit.
(V1-V5) are most important in both cases; they follow the pattern of the ransom forest.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)
cubist_model <- cubist(simulated[, c(1:10, 12)], simulated$y, committees = 100)
varImp(cubist_model)
Only Cubist has V2 as the most important. V6, which is uninformative in the previous models, is considered significant in the Cubist model.
Use a simulation to show tree bias with different granularities.
set.seed(100)
x1 <- rnorm(300, 30, 1)
x2 <- rnorm(300, 30, 2)
x3 <- rnorm(300, 30, 3)
set.seed(100)
zy <- (.4 * x1) + (.2 * x2) + (.1 * x3) + rnorm(300, 0, sqrt(1- (.16 + .04 + .01)))
y <- (1.5 * zy) + 10
simulated2 <- data.frame(x1 = x1, x2 = x2, x3 = x3, y=y)
rpartfit <- rpart(y ~., data = simulated2)
varImp(rpartfit)
Regression trees suffer from selection bias. Predictors with lower variance are favored over more granular predictors which has higher variance. In this simulation, x1 has the smallest standard deviation, so it is the strongest predictor in this case.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
Bagging fraction parameter - is the fraction of randomly sampled observations from the training set)
Shrinkage parameter - is known as the learning rate. The larger the number, the faster the learning rate.
Higher shrinking parameter means it will converge faster, thus taking larger steps down the gradient descent which may cause the algorithm to miss the optimal point and eventually overfit.
Which model do you think would be more predictive of other samples?
The learning rate of 0.1 is better since it is slower and the importance spreads out over more predictors. Small incremental steps down the gradient descent appears to work best.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?
data(solubility)
gbmGrid1 <- expand.grid(.interaction.depth = 1,
.n.trees = 100,
.shrinkage = 0.1,
.n.minobsinnode=10)
set.seed(100)
gbmTune1 <- train(solTrainXtrans,
solTrainY,
method = "gbm",
tuneGrid = gbmGrid1,
verbose = FALSE)
plot(varImp(gbmTune1), top = 30)
gbmGrid2 <- expand.grid(.interaction.depth = 20,
.n.trees = 100,
.shrinkage = 0.1,
.n.minobsinnode=10)
set.seed(100)
gbmTune2 <- train(solTrainXtrans,
solTrainY,
method = "gbm",
tuneGrid = gbmGrid2,
verbose = FALSE)
plot(varImp(gbmTune2), top = 30)
Looking at the two plots, we can see that the increase in the interaction.depth paramenter helps to spread out the importance among the predictors.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: Preprocessing
data("ChemicalManufacturingProcess")
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv"))
Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
yield <- as.matrix(ChemicalManufacturingProcess$Yield)
set.seed(789)
split2 <- yield %>%
createDataPartition(p = 0.8, list = FALSE, times = 1)
Xtrain.data <- Xpreprocess[split2, ] #chem train
xtest.data <- Xpreprocess[-split2, ] #chem test
Ytrain.data <- yield[split2, ] #yield train
ytest.data <- yield[-split2, ] #yield test
set.seed(100)
chem_rpart_tuned <- train(Xtrain.data, Ytrain.data,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
chem_m5_tuned <- train(Xtrain.data, Ytrain.data,
method = "M5",
tuneLength = 10,
control = Weka_control(M = 10))
chem_bagged_tuned <- train(Xtrain.data, Ytrain.data,
method = "treebag",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10))
chem_rf_tuned <- train(Xtrain.data, Ytrain.data,
method = "rf",
tuneLength = 10,
trControl = trainControl(method = "cv", number = 10))
gbmGrid <- expand.grid(.interaction.depth = 1,
.n.trees = 100,
.shrinkage = 0.1,
.n.minobsinnode = 10)
chem_gbm_tuned <- train(Xtrain.data, Ytrain.data,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
chem_cubist_tuned <- train(Xtrain.data, Ytrain.data,
method = "cubist")
RMSE = c(min(chem_rpart_tuned$results$RMSE), min(chem_m5_tuned$results$RMSE), min(chem_bagged_tuned$results$RMSE),min(chem_rf_tuned$results$RMSE),min(chem_gbm_tuned$results$RMSE), min(chem_cubist_tuned$results$RMSE))
Rsquared = c(max(chem_rpart_tuned$results$Rsquared), max(chem_m5_tuned$results$Rsquared), max(chem_bagged_tuned$results$Rsquared), max(chem_rf_tuned$results$Rsquared), max(chem_gbm_tuned$results$Rsquared), max(chem_cubist_tuned$results$Rsquared))
MAE = c(min(chem_rpart_tuned$results$MAE), min(chem_m5_tuned$results$MAE), min(chem_bagged_tuned$results$MAE), min(chem_rf_tuned$results$MAE), min(chem_gbm_tuned$results$MAE), min(chem_cubist_tuned$results$MAE))
results <- cbind(RMSE, Rsquared, MAE) %>% data.frame(row.names = c("RPART", "M5", "BAG", "RF", "GBM", "CUBIST"))
kable(results) %>% kable_styling(bootstrap_options = "striped")
RMSE | Rsquared | MAE | |
---|---|---|---|
RPART | 1.334645 | 0.5054817 | 1.0618190 |
M5 | 1.517643 | 0.4258644 | 1.0991966 |
BAG | 1.138023 | 0.6642550 | 0.8995001 |
RF | 1.140752 | 0.6978560 | 0.8950110 |
GBM | 1.248370 | 0.5706191 | 0.9786424 |
CUBIST | 1.180610 | 0.6165652 | 0.9265535 |
gives the optimal resampling and test set performance?
Based on table above, the cubist model outperforms the other models on both training and test because of the low RMSE value.
chem_cubist_tuned
## Cubist
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.801862 0.3562192 1.3305668
## 1 5 1.786225 0.3629854 1.3000932
## 1 9 1.790586 0.3608555 1.3102473
## 10 0 1.268888 0.5639529 0.9992920
## 10 5 1.246434 0.5793699 0.9715563
## 10 9 1.258836 0.5717701 0.9842097
## 20 0 1.204203 0.6013281 0.9524467
## 20 5 1.180610 0.6165652 0.9265535
## 20 9 1.194745 0.6077350 0.9389085
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 20 and neighbors = 5.
The final values used for the model were committees = 20 and neighbors = 5.
chem_cubist_tuned$bestTune
chem_cubist_tuned$finalModel
##
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
##
## Number of samples: 144
## Number of predictors: 56
##
## Number of committees: 20
## Number of rules per committee: 2, 1, 1, 1, 7, 1, 6, 1, 6, 1, 4, 1, 1, 1, 7, 1, 3, 1, 4, 1
ggplot(chem_cubist_tuned)
plotmo(chem_cubist_tuned)
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## -0.1070431 -0.04306519 -0.1062217
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## -0.07283723 -0.004683137 -0.09620684
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 0.06681 -0.04830923 -0.1178766
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## -0.1012727 -0.03863564 0.1056672
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 0.5096271 0.1087038 0.3424324
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## -0.06529069 -0.2229103 0.4390925
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 0.8941637 0.1066231 -0.1030952
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 0.02020002 -0.4806937 -0.007834829
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 0.04826216 -0.09295527 0.06169755
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 0.005007187 0.06617593 -0.1360039
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 0.0688801 -0.1744786 -0.1218132
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## -0.01031118 -0.1438567 0.0651293
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 0.06432695 0.06918722 0.7255096
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## -0.0066778 0.03954225 0.09273307
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## -0.08632349 0.1836771 0.1182687
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## -0.05513017 0.4884872 -0.03063781
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 0.7174727 0.231727 -0.4626528
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## -0.4405878 0.2027957 -0.1289558
## ManufacturingProcess44 ManufacturingProcess45
## 0.2946725 0.1522024
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
varImp(chem_cubist_tuned)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess17 100.000
## ManufacturingProcess32 94.667
## ManufacturingProcess09 64.000
## BiologicalMaterial03 60.000
## BiologicalMaterial02 50.667
## ManufacturingProcess13 46.667
## BiologicalMaterial12 42.667
## ManufacturingProcess04 42.667
## ManufacturingProcess33 42.667
## ManufacturingProcess39 34.667
## BiologicalMaterial06 34.667
## ManufacturingProcess15 26.667
## ManufacturingProcess29 25.333
## ManufacturingProcess37 20.000
## BiologicalMaterial08 18.667
## ManufacturingProcess02 18.667
## ManufacturingProcess27 14.667
## ManufacturingProcess26 14.667
## ManufacturingProcess31 13.333
## ManufacturingProcess01 9.333
plot(varImp(chem_cubist_tuned))
Comparison
Ex 6.3 Linear (Ridge)
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 15))
set.seed(101)
ridgeRegFit <- train(Xtrain.data, Ytrain.data, method = "ridge", tuneGrid = ridgeGrid, trControl = trainControl(method = "cv", number = 10))
predictions <- ridgeRegFit %>% predict(xtest.data)
cbind(
RMSE = RMSE(predictions, ytest.data),
R_squared = R2(predictions, ytest.data)
)
## RMSE R_squared
## [1,] 1.06489 0.5657402
varImp(ridgeRegFit)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess13 100.00
## ManufacturingProcess32 97.82
## ManufacturingProcess17 86.84
## BiologicalMaterial06 83.64
## BiologicalMaterial03 78.13
## ManufacturingProcess09 72.37
## BiologicalMaterial12 72.20
## ManufacturingProcess36 70.51
## BiologicalMaterial02 63.10
## ManufacturingProcess06 61.88
## ManufacturingProcess31 58.39
## BiologicalMaterial11 56.85
## ManufacturingProcess33 47.06
## ManufacturingProcess11 45.94
## BiologicalMaterial04 45.43
## ManufacturingProcess29 44.71
## BiologicalMaterial08 44.36
## ManufacturingProcess12 38.22
## BiologicalMaterial01 35.56
## BiologicalMaterial09 33.79
plot(varImp(ridgeRegFit))
Ex. 7.5 Non-Linear (MARS)
marsGrid <- expand.grid(.degree = 1:3, .nprune = 2:100)
set.seed(100)
chem_mars_tuned <- train(Xtrain.data, Ytrain.data,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv", number = 10))
## Loading required package: earth
## Warning: package 'earth' was built under R version 3.6.3
marspred <- predict(chem_mars_tuned, newdata = xtest.data)
marspv <- postResample(pred = marspred, obs = ytest.data)
marspv
## RMSE Rsquared MAE
## 1.1997927 0.4779621 0.9605993
varImp(chem_mars_tuned)
## earth variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 60.24
## ManufacturingProcess14 0.00
## ManufacturingProcess22 0.00
## BiologicalMaterial02 0.00
## ManufacturingProcess30 0.00
## ManufacturingProcess38 0.00
## ManufacturingProcess29 0.00
## BiologicalMaterial01 0.00
## BiologicalMaterial05 0.00
## ManufacturingProcess40 0.00
## ManufacturingProcess01 0.00
## BiologicalMaterial06 0.00
## BiologicalMaterial10 0.00
## ManufacturingProcess21 0.00
## ManufacturingProcess33 0.00
## ManufacturingProcess31 0.00
## ManufacturingProcess28 0.00
## ManufacturingProcess44 0.00
## ManufacturingProcess12 0.00
plot(varImp(chem_mars_tuned))
Notes:
All three models have the manufacturing processes as very important predictors.
All models have ManufacturingProcess32 in the top 3.
Only the non-linear model considered two predictors as important and they are both from the manufacturing.
Tree model is the best for this data based on the performance scores.
with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
fancyRpartPlot(chem_rpart_tuned$finalModel, palettes = 'PuRd')
Yes, a graphical view of the data does provides a better picture. The tree has several branches with details for each important predictor at some level. It also depicts the relationship between the biological and process predictors and the yield.