library(tidyverse)
library(partykit)
library(party)
library(rpart)
library(mlbench)
library(AppliedPredictiveModeling)
library(randomForest)
library(caret)
library(gbm)
library(Cubist)
library(rpart.plot)Recreate the simulated data from Exercise 7.2:
set.seed(222)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
kableExtra::kable(rfImp1)## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
| Overall | |
|---|---|
| V1 | 6.2731059 |
| V2 | 7.5697862 |
| V3 | 0.7202476 |
| V4 | 12.2518902 |
| V5 | 1.2145862 |
| V6 | -0.0155427 |
| V7 | -0.1992948 |
| V8 | -0.0606985 |
| V9 | -0.0264306 |
| V10 | -0.0584667 |
Question: Did the random forest model significantly use the uninformative predictors (V6 – V10)?
Answer: No, the random forest model does not significantly use the uninformative predictors V6 to V10.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
set.seed(111)
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)## [1] 0.9425556
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
When another highly correlated predictor is added, the important scores for the other variables increase while the importance score for V1 decreased yet more.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
kableExtra::kable(rfImp2)| Overall | |
|---|---|
| V1 | 3.8090525 |
| V2 | 7.1975556 |
| V3 | 0.5529476 |
| V4 | 11.9729937 |
| V5 | 1.0865307 |
| V6 | -0.0290631 |
| V7 | -0.1067976 |
| V8 | -0.0684317 |
| V9 | -0.1140780 |
| V10 | -0.0062604 |
| duplicate1 | 4.0596497 |
set.seed(111)
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)## [1] 0.9425556
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
kableExtra::kable(rfImp3)| Overall | |
|---|---|
| V1 | 2.7254180 |
| V2 | 7.7360969 |
| V3 | 0.4062591 |
| V4 | 12.4416632 |
| V5 | 1.2799369 |
| V6 | 0.0046985 |
| V7 | -0.0481235 |
| V8 | -0.0595274 |
| V9 | -0.0940122 |
| V10 | -0.0356415 |
| duplicate1 | 2.6684864 |
| duplicate2 | 3.1529421 |
The importance score for V1 was reduced with the addition of duplicated1. The addition of another correlated variable to V1 further reduces the importance.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varImp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model4 <- cforest(y~., data = simulated)varImp(model4, conditional = FALSE) ## Overall
## V1 3.47903152
## V2 6.72312059
## V3 0.06836182
## V4 12.45489178
## V5 0.61216864
## V6 -0.02008054
## V7 -0.02639576
## V8 -0.01154652
## V9 -0.10373868
## V10 -0.05339805
## duplicate1 3.18612144
## duplicate2 1.44229700
varImp(model4, conditional = TRUE)## Overall
## V1 1.782259041
## V2 6.340471627
## V3 0.017436678
## V4 11.002228344
## V5 0.485233570
## V6 0.006385456
## V7 -0.036130064
## V8 -0.042626261
## V9 -0.039970254
## V10 -0.001984277
## duplicate1 1.210519592
## duplicate2 0.409864153
The variable importances differ between the conditional and nonconditional parameter. It should be noted that the uninformative features, V6-V10, contribute very little if at all to the model outcome.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
The pattern of the uniformative features continued as most received and importance factor of about 0. A notable occurrence is that duplicate2 did not receive any weighting in variable importance.
set.seed(111)
model5 <- gbm(y~., data = simulated, distribution = "gaussian")
summary.gbm(model5)## var rel.inf
## V4 V4 42.4706128
## V2 V2 23.1043497
## duplicate1 duplicate1 16.4936039
## V3 V3 6.3746279
## V1 V1 5.7895306
## V5 V5 5.0562346
## V9 V9 0.2987317
## V10 V10 0.2629378
## V7 V7 0.1493710
## V6 V6 0.0000000
## V8 V8 0.0000000
## duplicate2 duplicate2 0.0000000
The cubist model had similar variable importance ranking with the gbm model. The less correlated features were assigned little importance. The variable duplicate2 was also given a value of 0.
model6 <- cubist(x = simulated[,-11], y = simulated$y, committees = 100)
varImp(model6)## Overall
## V2 63.0
## V3 45.5
## V1 55.5
## V4 35.5
## duplicate1 8.0
## V7 4.5
## V5 29.0
## V6 1.0
## V8 1.0
## V9 0.0
## V10 0.0
## duplicate2 0.0
Use a simulation to show tree bias with different granularities.
In the case there is a variable that has higher number of unique values, the tree will select that variable over others. There is a higher chance the model will choose the noise variables over the informative variables in the top nodes.
For this simulation, the tree-based model selected the variables that have more distinct values as more important. It also selected the noisy or the variables with the most repetitive values as the top node.
set.seed(917)
a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)
y <- a + b + c + d + e
simData <- data.frame(a,b,c,d,e,y)
rpartTree <- rpart(y ~ ., data = simData)
plot(as.party(rpartTree), gp = gpar(fontsize = 7))varImp(rpartTree)## Overall
## a 2.122025
## b 3.780325
## c 3.273619
## d 2.981395
## e 3.514504
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right focuses its importance on just the first few of predictors and the model on the left spreads importance across more predictors is because to how the tuning parameters, bagging fraction and learning rate are assigned. A lower bagging fraction implies that less data is to be selected for training the subsequent tree whereas a higher bagging fraction would do the opposite. The learning rate affects the model fitting process by taking incremental steps toward a local minimum. The model on the left spreads the importance over more variables because it had a slower learning rate and used less data for subsequent trees which allowed the extraction of more information from other variables. The model on the right had a faster learning rate and used more data, forcing the convergence of the model to focus on a handful of predictors.
I would think that the model on the left to be more predictive of other samples as it is less likely to have overfit to the training set.
An increase in the interaction depth would provide the tree with more iterations to learn from other predictors, lowering the slope of the predictor importance as more predictors will hold importance.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and Spre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
imputed.knn <- preProcess(ChemicalManufacturingProcess,
method = "knnImpute",
k = sqrt(nrow(ChemicalManufacturingProcess))
)
imputed.data <- predict(imputed.knn, ChemicalManufacturingProcess)
near_zero <- nearZeroVar(imputed.data)
imputed.data <- imputed.data[, -near_zero]
set.seed(917)
train_test_split <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.7, list = FALSE)
train.data <- imputed.data[train_test_split,]
test.data <- imputed.data[-train_test_split,]set.seed(917)
single <- train(Yield ~ .,
data = train.data,
method = "rpart",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
pred1 <- predict(single, newdata = test.data)
rpart.metrics <- postResample(pred = pred1, obs = test.data$Yield)
rpart.metrics## RMSE Rsquared MAE
## 0.7943422 0.3450245 0.5800113
set.seed(917)
bagged <- train(Yield ~ .,
data = train.data,
method = "treebag",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
pred2 <- predict(bagged, newdata = test.data)
bagged.metrics <- postResample(pred = pred2, obs = test.data$Yield)
bagged.metrics## RMSE Rsquared MAE
## 0.5557336 0.6426890 0.4099427
set.seed(917)
gb <- gbm(Yield~., data = train.data, distribution = "gaussian")
pred3 <- predict(gb, newdata = test.data)## Using 100 trees...
boosted.metrics <- postResample(pred = pred3, obs = test.data$Yield)
boosted.metrics## RMSE Rsquared MAE
## 0.5093102 0.7036780 0.3871666
set.seed(917)
rf <- train(Yield ~ .,
data = train.data,
method = "rf",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
pred4 <- predict(rf, newdata = test.data)
rf.metrics <- postResample(pred = pred4, obs = test.data$Yield)
rf.metrics## RMSE Rsquared MAE
## 0.4879136 0.7743460 0.3955877
cubist.model <- train(Yield ~ .,
data = train.data,
method = "cubist",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))
pred5 <- predict(cubist.model, newdata = test.data)
cubist.metrics <- postResample(pred = pred5, obs = test.data$Yield)
cubist.metrics## RMSE Rsquared MAE
## 0.4064884 0.8098096 0.3207813
The best model across all metrics on the testing set data is the cubist model.
kableExtra::kable(rbind(rpart.metrics,
bagged.metrics,
boosted.metrics,
rf.metrics,
cubist.metrics))| RMSE | Rsquared | MAE | |
|---|---|---|---|
| rpart.metrics | 0.7943422 | 0.3450245 | 0.5800113 |
| bagged.metrics | 0.5557336 | 0.6426890 | 0.4099427 |
| boosted.metrics | 0.5093102 | 0.7036780 | 0.3871666 |
| rf.metrics | 0.4879136 | 0.7743460 | 0.3955877 |
| cubist.metrics | 0.4064884 | 0.8098096 | 0.3207813 |
varImp(cubist.model) ## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 62.79
## ManufacturingProcess13 62.79
## BiologicalMaterial02 45.35
## ManufacturingProcess37 41.86
## BiologicalMaterial03 38.37
## ManufacturingProcess33 31.40
## BiologicalMaterial10 29.07
## ManufacturingProcess39 29.07
## ManufacturingProcess17 27.91
## ManufacturingProcess04 26.74
## ManufacturingProcess09 18.60
## BiologicalMaterial12 17.44
## ManufacturingProcess15 16.28
## ManufacturingProcess02 16.28
## BiologicalMaterial04 15.12
## BiologicalMaterial09 12.79
## ManufacturingProcess06 11.63
## ManufacturingProcess30 11.63
## ManufacturingProcess43 10.47
The single tree plot provides useful information about the data. The plot shows that the most decisive split occurs with the root node ManufacturingProcess32. The model shows that lower yeild is associated with BiologicalMaterial12, ManufacturingProcess18, BiologicalMaterial04, and BiologicalMaterial11. Higher Yields are associated with ManufacturingProcess31, BiologicalMater05, and ManufacturingProcess17.
rpart.model <- rpart(Yield~., data = train.data)
rpart.plot(rpart.model)