Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"Fit a random forest model to all of the predictors, then estimate the variable importance score.
Random forests is more computationally efficient on a tree-by-tree basis since the tree building process only needs to evaluate a fraction of the original predictors at each split, although more trees are usually required by random forests. Combining this attribute with the ability to parallel process tree building makes random forests more computationally efficient than boosting.
Let’s fit the dataset in the random forests model and find the variable or feature importance.
library(randomForest)
library(caret)
model.rf<- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
model.rf##
## Call:
## randomForest(formula = y ~ ., data = simulated, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.754258
## % Var explained: 72.3
Now, finding the feature importance. Here we can see all the variables and their overall score.
rf.imp <- varImp(model.rf, scale = FALSE)
rf.imp## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Using varImpPlot to view variables in the plot and understand their ranking/ importance
varImpPlot(model.rf)We can see that variables frm V6 to V10 do not appear to be important at all.
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
simulated$dup1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$dup1, simulated$V1)## [1] 0.9460206
We can see there is ~94% correlation with V1.
model.rf2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
model.rf2##
## Call:
## randomForest(formula = y ~ ., data = simulated, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.922537
## % Var explained: 71.61
rf2.Imp <- varImp(model.rf2, scale = FALSE)
rf2.Imp## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## dup1 4.28331581
varImpPlot(model.rf2)We can see that variable V1 importance decreased significantly with added predictor, the duplicate variable.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that func- tion toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(data.table)
library(party)
simulated <- subset(simulated, select = -dup1)
model3 <- cforest(y ~ ., data = simulated)
feature.imp <- varimp(model3)
feature.imp## V1 V2 V3 V4 V5
## 9.0625534937 6.7499292185 0.0207328511 8.2653505901 1.8439761772
## V6 V7 V8 V9 V10
## 0.0005993854 0.0344981972 -0.0318008565 0.0046931439 -0.0231819864
It appears that the cforest also has V1-V5 as imporatnt variables, and V6-V10 are low ranked features.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(Cubist)## Warning: package 'Cubist' was built under R version 3.4.4
model.cub <- cubist(simulated[,-11], simulated$y)
model.cub##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
## Number of samples: 200
## Number of predictors: 10
##
## Number of committees: 1
## Number of rules: 1
summary(model.cub)##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Sun Apr 28 22:01:09 2019
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (11 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.944664]
##
## outcome = 0.183529 + 8.9 V4 + 7.9 V1 + 7.1 V2 + 5.3 V5
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.224012
## Relative |error| 0.55
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
##
##
## Time: 0.0 secs
It appears that Cubist only uses highly important variables. V1, V2, V4 and V5 are the most important features that got used by Cubist model.
Use a simulation to show tree bias with different granularities.
We can use rpart to fit the model and rpart.plot which is an extension of plot to visualize the same.
library(tidyverse)
library(rpart)
library(rpart.plot)
data.set <- twoClassSim(300, noiseVars = 8, corrVar = 6, corrValue = 0.8) %>%
mutate(TwoFactor1 = as.factor(round(TwoFactor1, 0)),
TwoFactor2 = as.factor(round(TwoFactor2, 0)))
r.fit <- rpart(Linear01 ~ ., data=data.set)
rpart.plot(r.fit)varImp(r.fit)## Overall
## Corr1 0.78249579
## Corr2 0.43282264
## Corr3 0.46687451
## Corr4 0.90525169
## Corr5 0.37270428
## Corr6 0.72610752
## Linear02 0.83759541
## Linear03 0.45900186
## Linear04 0.95666938
## Linear05 1.25476901
## Linear06 0.76111067
## Linear07 0.19864176
## Linear08 0.55827948
## Linear09 0.14981697
## Linear10 0.59264207
## Noise2 0.31739566
## Noise3 0.56824421
## Noise4 1.48251253
## Noise5 0.05101641
## Noise6 0.28014828
## Noise7 0.28239327
## Noise8 0.57478978
## Nonlinear2 0.70332393
## Nonlinear3 0.51860792
## TwoFactor1 0.82678816
## TwoFactor2 0.30394295
## Nonlinear1 0.00000000
## Noise1 0.00000000
## Class 0.00000000
we can see overall score for linear, non-linear, noise and correlation also.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right side has its importance on just first few predictors. As the model tranining increases the number of focused predictors decreases.
Also, the higher is the fraction, less predictors will be identified as important.
Which model do you think would be more predictive of other samples?
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
- Which tree-based regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
library(missForest)
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
df_imp1 <- missForest(df)## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
df_imp <- df_imp1$ximp
data <- df_imp[,2:58]
target <- df_imp[,1]
training <- createDataPartition( target, p=0.75 )
predictor_training <- data[training$Resample1,]
target_training <- target[training$Resample]
predictor_testing <- data[-training$Resample1,]
target_testing <- target[-training$Resample1]
ctrl <- trainControl(method = "cv", number = 10)Single Tree Model building and tuning
rt_grid <- expand.grid(maxdepth= seq(1,10,by=1))
rt_Tune <- train(x = predictor_training, y = target_training, method = "rpart2", metric = "Rsquared", tuneGrid = rt_grid, trControl = ctrl)Predict
rt_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)## RMSE Rsquared MAE
## 1.2487359 0.5731629 0.9551321
Random Forest Model building and tuning
rf_grid <- expand.grid(mtry=seq(2,38,by=3))
rf_Tune <- train(x = predictor_training, y = target_training, method = "rf", tuneGrid = rf_grid, metric = "Rsquared", importance = TRUE, trControl = ctrl)Predict
rf_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)## RMSE Rsquared MAE
## 1.2487359 0.5731629 0.9551321
Cubist Model building and tuning
cube_grid <- expand.grid(committees = c(1, 5, 10, 20, 50), neighbors = c(0, 1, 3, 5))
cube_Tune <- train(x = predictor_training, y = target_training, method = "cubist", metric = "Rsquared", tuneGrid = cube_grid, trControl = ctrl)Predict
cube_pred = predict(cube_Tune, predictor_testing)
postResample(pred = cube_pred, obs = target_testing)## RMSE Rsquared MAE
## 1.2013718 0.6230656 0.8570788
Cubist model has the best RMSE score. One of the main reason could be that uses only top predictors.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cube_Tune), top=10, scales = list(y = list(cex = 0.8)))Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(partykit)
plot(as.party(rt_Tune$finalModel),gp=gpar(fontsize=11))