library(AppliedPredictiveModeling)
library(tidyverse)
library(mlbench)
library(randomForest)
library(caret)
library(party)
library(kernlab)
library(gbm)
library(Cubist)
library(rpart)
library(ranger)
library(partykit)
Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree =1000)
rfimp1 <- varImp(model1, scale = FALSE)
rfimp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 +rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
set.seed(334)
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.27087511
## V2 6.07898644
## V3 0.63113644
## V4 7.12083731
## V5 1.84583776
## V6 0.06925368
## V7 0.04242028
## V8 -0.07037316
## V9 0.05503098
## V10 0.01740119
## duplicate1 4.47909114
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.93512
We can see a drop from every predictor importance value to balance a little bit but interestingly enough we also can see that the duplicate of V1 is the 4th most important predictor value at this time.
Use the cforest function in the party package to fit a
random forest model using conditional inference trees. The party package
function varimp can calculate predictor importance. The conditional
argument of that function toggles between the traditional importance
measure and the modified version described in Strobl et al. (2007). Do
these importances show the same pattern as the traditional random forest
model?
model3 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp3 <- varimp(model3, conditional = TRUE)
rfImp3
## V1 V2 V3 V4 V5 V6 V7
## 6.2077667 5.2073432 0.1852956 6.1055620 1.3268631 -0.1120681 -0.1247807
## V8 V9 V10
## -0.3175298 -0.3357481 -0.2707247
Looking at the importance we can see the same pattern as the traditional random forest as V1, V2,V4 are all the strongest predictors for these models as well as V6-V10 are all shown to be a weaker importance value and would probably be cutt out of any parameter tuning.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(334)
gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
rfImp4 <- varImp(gbmTune$finalModel, scale = FALSE)
rfImp4
## Overall
## V1 4387.9940
## V2 3762.0715
## V3 1347.2951
## V4 4550.8658
## V5 2148.1090
## V6 389.8535
## V7 404.6429
## V8 279.8491
## V9 273.3015
## V10 324.8737
set.seed(334)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp5 <- varImp(cubistTuned$finalModel, scale = FALSE)
rfImp5
## Overall
## V1 72.0
## V3 42.0
## V2 54.5
## V4 49.0
## V5 40.0
## V6 11.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
The Same Pattern seems to be reoccuring in Both Boosted Trees and Cubist with V1,V3 and V4 being the strongest predictors based on importance.
Use a simulation to show tree bias with different granularities.
set.seed(21)
# Data with varying granularities
a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)
y <- a + b + c + d + e
simData <- data.frame(a, b, c, d, e, y)
# Fit Decision tree model
rpartTree <- rpart(y ~ ., data = simData)
tree_party <- as.party(rpartTree)
plot(tree_party, gp = gpar(fontsize = 7))
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The Model on the right seems to focus on its important on only a few predictors due to overfitting. This can be seen with it’s higher learning rate and bagging fraction. The learning rate is the percentage of the current predicted value added to the previous iteration’s prediction, indicating the prediction speed. As the bagging fraction represents the portion of the training data that was utilized with the model itself. Since the model on the right has a higher learning rate, this would translate to increasing the weight of each predictor. Compared to the model on the left which seems to have a smaller learning rate and bagging fraction. As the model is then learning slower and performing better it also uses a smaller subset of data so the weights of each predictor isn’t going to be as high singularity but more distributed between all of them.
Which model do you think would be more predictive of other samples?
The model on the left would be better at predicting other samples since it goes though more iterations. This means that with an increase amount of iteration, the weight of each individual is decreased. Overall this will cause the model on the left generalizes better which in term means that it is more accurate.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Firstly, interaction depth is the number of splits that is performed on a tree or the maximum number of nodes per each tree. So when the interaction depth increase, the importance of each predictor also increase. This then results in the smaller important predictors to be evenly weighted. The slope of the predictor importance for either model will then increase or shows the slope getting steeper.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data(ChemicalManufacturingProcess)
# Using Knn imputation
impute <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
imputed <- predict(impute, ChemicalManufacturingProcess)
imputed <- imputed %>% select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
set.seed(22)
X <- dplyr::select(imputed, -Yield)
y <- imputed$Yield
index <- createDataPartition(y, p = .8, list = FALSE)
train_X <- X[index, ] %>% as.matrix()
test_X <- X[-index, ] %>% as.matrix()
train_y <- y[index]
test_y <- y[-index]
Which tree-based regression model gives the optimal resampling and test set performance?
It seems that GBM Model has the best performance with a resulting RMSE of 0.4912689 and the highest R-Squared value of 0.6838583. The 2nd best performing model would be Random Forest and Bagged Model performing the worst out of the 3.
set.seed(17)
bag_setup = bagControl(fit = ctreeBag$fit, predict = ctreeBag$pred, aggregate = ctreeBag$aggregate)
bag_model_fit <- train(x = train_X, y = train_y, method="bag", bagControl = bag_setup,
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),importance = "permutation",
tuneLength = 25)
bag_pred <- predict(bag_model_fit, test_X)
postResample(bag_pred, test_y)
## RMSE Rsquared MAE
## 0.6213009 0.4955786 0.4969667
set.seed(17)
rf_model_fit <- train(x = train_X, y = train_y, method = "ranger",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
importance = "permutation",
tuneLength = 25)
rf_pred <- predict(rf_model_fit, test_X)
postResample(rf_pred, test_y)
## RMSE Rsquared MAE
## 0.5387580 0.6069437 0.4343783
grid_params <- expand.grid(n.trees=c(50, 100),
interaction.depth=c(1, 5, 10),
shrinkage=c(0.01, 0.1, 0.2),
n.minobsinnode=c(5, 10))
gmb_model_fit<- train(x = train_X, y = train_y,
method = 'gbm',
tuneGrid = grid_params,
verbose = FALSE)
gmb_pred <- predict(gmb_model_fit, test_X)
postResample(gmb_pred, test_y)
## RMSE Rsquared MAE
## 0.4912689 0.6838583 0.4018570
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(gmb_model_fit),
top=10,
main="Top 10 Important Features with GBM Model")
plot(varImp(rf_model_fit),
top=10,
main="Top 10 Important Features with Random Forest Model")
plot(varImp(bag_model_fit),
top=10,
main="Top 10 Important Features with Bagging Model")
Interestingly enough, we can see that GBM which was the model
performed the most has the highest ratio of Manufacturing Process
Predictors to Biological Materials at a ratio of 7:3. Random Forest has
an even split of 5:5 predictors and Bagging actually has the same. Since
the Optimal Linear and Non-Linear Models previously had a higher amount
of Manufacturing Process predictors in their top 10 we can see a trend
in that relying on Manufacturing Process Predictors will result in a
better model. It is also should be noted that in all the models done
within this data set Manufacturing Process 32 remains the
top predictor.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
model <- train(x = train_X, y = train_y, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10)
finalRpartModel <- model$finalModel
tree_party <- as.party(finalRpartModel)
plot(tree_party, main = "Optimal Single Tree Plot", gp = gpar(fontsize = 7))
This View does help demonstrate the separation of when Manufacturing
Process continues to be the driving factor after MP32. Since it seems
when the value is on either side of 0.192, it goes toward Biological
Material 11 if its under 0.192 and goes to Manufacturing Process 06 if
its above 0.192. It is also interesting to note that when it goes
through 3 Manufacturing Process Predictor it seems to yield the greatest
results.