Recreate the simulated data from Exercise 7.2
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a) Fit a random forest model to all of the predictors, then estimate the variable importance scores
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model.rf<- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
model.rf
##
## Call:
## randomForest(formula = y ~ ., data = simulated, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.754258
## % Var explained: 72.3
Finding Feeature importance:
rf.imp <- varImp(model.rf, scale = FALSE)
rf.imp
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Almost 0 to negative values, evidence that Random Forest didn’t use the predictors (V6 thru V10).
Using varImpPlot to view variables in the plot and understand their ranking/ importance
varImpPlot(model.rf)
Clearly, variables frm V6 to V10 do not appear to be important at all.
(b) Now add an additional predictor that is highly correlated with one of the informative predictors.
simulated$dup1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$dup1, simulated$V1)
## [1] 0.9460206
We can see there is ~94% correlation with V1.
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model.rf2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
model.rf2
##
## Call:
## randomForest(formula = y ~ ., data = simulated, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 6.922537
## % Var explained: 71.61
rf2.Imp <- varImp(model.rf2, scale = FALSE)
rf2.Imp
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## dup1 4.28331581
varImpPlot(model.rf2)
We can see that variable V1 importance decreased significantly with added predictor dup1 that is highly correlated with V1.
(c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(data.table)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
model3 <- cforest(y ~ ., data = simulated)
sort(varimp(model3, conditional = FALSE), decreasing = TRUE)
## V4 V2 dup1 V1 V5
## 7.6223892727 6.0579730772 5.0941897280 4.6171158805 1.7161194047
## V7 V9 V3 V6 V10
## 0.0465374951 0.0046062409 0.0003116115 -0.0289427183 -0.0310326410
## V8
## -0.0380965511
sort(varimp(model3, conditional = TRUE), decreasing = TRUE)
## V4 V2 dup1 V1 V5 V6
## 6.190578351 4.688980360 1.926751729 1.807953145 1.051666850 0.028174759
## V9 V3 V8 V7 V10
## 0.015118505 0.012878752 -0.004356587 -0.011709437 -0.022190587
By setting conditional = TRUE, it took longer to process and the results were different. The uninformative predictors (V6 thru V10) remain the same. By setting conditional = TRUE, V3 has become become uninformative.
(d) Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
#Boosted Tree
library(gbm)
## Loaded gbm 2.1.8
gbmModel <- gbm(y ~ ., data = simulated, distribution = 'gaussian')
gbmModel
## gbm(formula = y ~ ., distribution = "gaussian", data = simulated)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 11 predictors of which 7 had non-zero influence.
summary(gbmModel)
## var rel.inf
## V4 V4 29.9389549
## V2 V2 21.6012918
## V1 V1 15.6771729
## dup1 dup1 13.2189845
## V5 V5 10.0440211
## V3 V3 9.0948053
## V6 V6 0.4247697
## V7 V7 0.0000000
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
For Boosted Trees, same patten occurs. V4 is still the highest and V6 thru V10 are still low.
#Cubist
library(Cubist)
model.cub <- cubist(simulated[,-11], simulated$y)
model.cub
##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
## Number of samples: 200
## Number of predictors: 11
##
## Number of committees: 1
## Number of rules: 1
summary(model.cub)
##
## Call:
## cubist.default(x = simulated[, -11], y = simulated$y)
##
##
## Cubist [Release 2.07 GPL Edition] Sun May 02 17:30:19 2021
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (12 attributes) from undefined.data
##
## Model:
##
## Rule 1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.936506]
##
## outcome = 0.269253 + 8.9 V4 + 7.1 V2 + 5.1 V5 + 4.8 V1 + 3.2 dup1
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 2.147257
## Relative |error| 0.53
## Correlation coefficient 0.85
##
##
## Attribute usage:
## Conds Model
##
## 100% V1
## 100% V2
## 100% V4
## 100% V5
## 100% dup1
##
##
## Time: 0.0 secs
It appears that Cubist only uses highly important variables.
V1, V2, V4 and V5 are the most important features that got used by Cubist model.
Use a simulation to show tree bias with different granularities
Trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010). Loh and Shih (1997) remarked that “The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”
Also, as the number of missing values increases, the selection of predictors becomes more biased (Carolin et al. 2007). It is worth noting that the variable importance scores for the solubility regression tree (Fig. 8.6) show that the model tends to rely more on continuous (i.e., less granular) predictors than the binary fingerprints. This could be due to the selection bias or the content of the variables.
Let’s create a simulated experiment in this case as follows:
# Defining granular data
set.seed(123)
t1 <- rep(1:2,each=100)
# Defining relationsships
y <- t1 + rnorm(200,mean=0,sd=4)
# Defining non-granular data
set.seed(231)
t2 <- rnorm(200,mean=0,sd=2)
SData <- data.frame(Y=y,X1=t1,X2=t2)
head(SData)
## Y X1 X2
## 1 -1.24190259 1 -1.0662038
## 2 0.07929004 1 -4.6233276
## 3 7.23483326 1 -1.9083957
## 4 1.28203357 1 0.5250315
## 5 1.51715094 1 -0.9467149
## 6 7.86025995 1 0.4347946
boxplot(Y~X1,data=SData, main="Simulated Y values by X1",
xlab="X1", ylab="Y", col=(c("gold","darkgreen")))
plot(Y~X2,data=SData)
Predictor X1 has only two categories, but is defined to to create two more homogenous groups with respect to the response. Predictor X2 has 200 possible categories (is more granular) and is not related to the response.
library(rpart)
selectedPredictors <- data.frame(Predictor=as.character())
for (i in 1:100 ) {
set.seed(i)
X1 <- rep(1:2,each=100)
Y <- X1 + rnorm(200,mean=0,sd=4)
#Y <- rnorm(200,mean=0,sd=2)
set.seed(1000+i)
X2 <- rnorm(200,mean=0,sd=2)
currentSData <- data.frame(Y=Y,X1=X1,X2=X2)
currentRpart <- rpart(Y~X1+X2,data=currentSData,control=rpart.control(maxdepth=1))
currentPredictor <- data.frame(Predictor=rownames(currentRpart$splits)[1])
selectedPredictors <- rbind(selectedPredictors,currentPredictor)
}
treeBiasTable <- table(selectedPredictors)
table(selectedPredictors$Predictor)
##
## X2 X1
## 52 44
library(rpart)
simTree <- rpart(Y~., data = SData)
varImp(simTree)
## Overall
## X1 0.3105884
## X2 1.0494224
Clearly, X2 is identified as important. In this case, X1 and X2 are selected in near equal proportions despite the fact that the response is defined based on information from X1. As the amount of noise in the simulation increases, the chances that X2 are selected increase. Conversely, as the amount of noise decreases the chance that X2 is selected decreases. This implies that the granularity provided by X2 has a strong influence on whether or not it is selected, and not the fact that it has no association with the response.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right side has its importance on just first few predictors. As the model tranining increases the number of focused predictors decreases.
Also, the higher is the fraction, less predictors will be identified as important.
(b) Which model do you think would be more predictive of other samples?
The model performance will then decrease with the increase in parameter. The model on the left could perform better.
The left model is using more predictors, while the model on the right relies mainly on top 3 predictors.
(c) How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24.
Increase in the Interaction depth could make the model to spread the impportance across predictors.
So the right model could benefit more from it.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
(a) Which tree-based regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
library(missForest)
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess
df_imp1 <- missForest(df)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
set.seed(212)
df_imp <- df_imp1$ximp
data <- df_imp[,2:58]
target <- df_imp[,1]
training <- createDataPartition( target, p=0.75 )
predictor_training <- data[training$Resample1,]
target_training <- target[training$Resample]
predictor_testing <- data[-training$Resample1,]
target_testing <- target[-training$Resample1]
ctrl <- trainControl(method = "cv", number = 10)
Single Tree Model building and tuning
set.seed(201)
rt_grid <- expand.grid(maxdepth= seq(1,10,by=1))
rt_Tune <- train(x = predictor_training, y = target_training, method = "rpart2", metric = "Rsquared", tuneGrid = rt_grid, trControl = ctrl)
Predict
rt_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
## RMSE Rsquared MAE
## 1.3844557 0.4300004 1.0594751
Random Forest Model building and tuning
set.seed(201)
rf_grid <- expand.grid(mtry=seq(2,38,by=3))
rf_Tune <- train(x = predictor_training, y = target_training, method = "rf", tuneGrid = rf_grid, metric = "Rsquared", importance = TRUE, trControl = ctrl)
Predict
rf_pred = predict(rt_Tune, predictor_testing)
postResample(pred = rt_pred, obs = target_testing)
## RMSE Rsquared MAE
## 1.3844557 0.4300004 1.0594751
Cubist Model building and tuning
set.seed(201)
cube_grid <- expand.grid(committees = c(1, 5, 10, 20, 50), neighbors = c(0, 1, 3, 5))
cube_Tune <- train(x = predictor_training, y = target_training, method = "cubist", metric = "Rsquared", tuneGrid = cube_grid, trControl = ctrl)
Predict
cube_pred = predict(cube_Tune, predictor_testing)
postResample(pred = cube_pred, obs = target_testing)
## RMSE Rsquared MAE
## 0.7738568 0.8371268 0.6104266
Cubist model has the best performsance across all models we have seen and tuned above. It has the largest Rsquared value from all the previous models.
(b) Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
plot(varImp(cube_Tune), top=10, scales = list(y = list(cex = 0.8)))
Manufacturing process32 is the most important feature
Cubist model once again realies on top predictors.
(c) Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
plot(as.party(rt_Tune$finalModel),gp=gpar(fontsize=11))
Top predictors are Manufacture related processes
Manufacturer processes have a much larger impacting role towards yield