Recreate the simulated data from Exercise 7.2:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.0.5
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The random forest model dd not use the uninformative predictors significantly.
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change?
The importance score lowered for V1 while the scores for V2 and V4 increased and became more important than V1.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
What happens when you add another predictor that is also highly correlated with V1?
When another highly correlated predictor is added, the important scores for the other variables increase while the importance score for V1 decreased even more.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?These importance scores show almost similar patterns as the traditional random forest model. In the first model, V1 was considered the most important and V4 was the second most important. In the cforest model, these variables switch places.
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4
## V1 V2 V3 V4 V5 V6
## 6.35767602 5.33898936 0.14934113 6.13948255 1.38546210 -0.22927142
## V7 V8 V9 V10
## -0.12173581 -0.18135897 -0.02920453 -0.20105653
Even though, the importance scores are higher, there is almost the same pattern here as in the traditional random forest model. V1 is the most important, but the second most important is now V2.
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(y ~ ., data = simulated[, c(1:11)],
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
rfImp5 <- varImp(gbmTune$finalModel, scale = FALSE)
rfImp5
## Overall
## V1 4634.0234
## V2 4316.6044
## V3 1310.7178
## V4 4287.0793
## V5 1844.2147
## V6 416.3966
## V7 368.9383
## V8 224.4769
## V9 246.1400
## V10 250.6263
This function also shows similar patterns as other patterns. V2 is the second most important, unlike the traditional model.
set.seed(100)
cubistTuned <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubistTuned$finalModel, scale = FALSE)
rfImp6
## Overall
## V1 72.0
## V3 42.0
## V2 54.5
## V4 49.0
## V5 40.0
## V6 11.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
Use a simulation to show tree bias with different granularities.
When there is a variable that has higher number of distinct values, the tree will select that variable over others. There is a higher probability the model will select the noise variables over the informative variables in the top nodes.
In this simulation, the tree-based model selected the variables that have more distinct values as more important. It also selected the noisy or the variables with the most repetitive values as the top node.
set.seed(624)
a <- sample(1:10 / 10, 500, replace = TRUE)
b <- sample(1:100 / 100, 500, replace = TRUE)
c <- sample(1:1000 / 1000, 500, replace = TRUE)
d <- sample(1:10000 / 10000, 500, replace = TRUE)
e <- sample(1:100000 / 100000, 500, replace = TRUE)
y <- a + b + c + d + e
simData <- data.frame(a,b,c,d,e,y)
rpartTree <- rpart(y ~ ., data = simData)
plot(as.party(rpartTree), gp = gpar(fontsize = 7))
varImp(rpartTree)
## Overall
## a 1.702702
## b 4.769899
## c 3.551946
## d 3.205420
## e 3.957698
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The bagging fraction is the fraction of the training data used, whereas, the learning rate is the fraction of the current predicted value that is added to the previous iteration’s predicted value, or how fast the model learns. A lower learning rate is optimal, as it means there are more iterations. The model on the left has a smaller learning rate and bagging fraction, which means it learns slower and takes more computation time, thus performing better. It also uses a smaller subset of the data. The model on the right is most likely overfitting since it has a higher learning rate and bagging fraction. It uses more of the training data with each iteration and and is learning faster. Since it is learning faster, it increases the weight or contribution of each predictor, hence focuses its importance on just the first few predictors.
The model on the left would be more predictive of other samples, as there are more iterations, thus decreasing the weight of each predictor. It generalizes better, making it more accurate.
Interaction depth is the number of splits to perform on a tree, or the maximum nodes per tree. When the interaction depth increases, the importance of the predictors increases, allowing the smaller important predictors to contribute more. Hence, the slope would become steeper or increase.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
set.seed(100)
data(ChemicalManufacturingProcess)
# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(624)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]
set.seed(100)
cartTune <- train(train_x, train_y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
cartPred <- predict(cartTune, test_x)
postResample(cartPred, test_y)
## RMSE Rsquared MAE
## 1.5632389 0.5035625 1.1913334
set.seed(100)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)
## RMSE Rsquared MAE
## 1.2311358 0.7141633 0.8911824
set.seed(100)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)
## RMSE Rsquared MAE
## 1.2598015 0.7216121 0.8998200
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)
## RMSE Rsquared MAE
## 1.1824945 0.7195344 0.8678498
set.seed(100)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 0.9804618 0.7964499 0.7363520
The lowest RMSE is found in the cubist model, giving the best optimal resampling and test set performance.
rbind(cart = postResample(cartPred, test_y),
bagged = postResample(baggedPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y))
## RMSE Rsquared MAE
## cart 1.5632389 0.5035625 1.1913334
## bagged 1.2311358 0.7141633 0.8911824
## randomForest 1.2598015 0.7216121 0.8998200
## boosted 1.1824945 0.7195344 0.8678498
## cubist 0.9804618 0.7964499 0.7363520
The manufacturing process variables dominate the list at a ratio of 16:4, whereas the optimal linear and nonlinear models had ratios of 11:9.
varImp(cubistTuned, scale = TRUE)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess17 62.35
## ManufacturingProcess09 54.12
## BiologicalMaterial03 54.12
## BiologicalMaterial02 52.94
## ManufacturingProcess13 38.82
## BiologicalMaterial06 36.47
## ManufacturingProcess04 34.12
## ManufacturingProcess33 32.94
## ManufacturingProcess10 22.35
## ManufacturingProcess11 18.82
## ManufacturingProcess14 17.65
## ManufacturingProcess39 17.65
## ManufacturingProcess28 14.12
## ManufacturingProcess16 12.94
## BiologicalMaterial12 12.94
## ManufacturingProcess31 12.94
## ManufacturingProcess29 12.94
## ManufacturingProcess02 12.94
## ManufacturingProcess21 11.76
plot(varImp(cubistTuned), top = 20)
For the tree-based model, only 3 are biological variables out of the top 10, compared to 4 in the linear and nonlinear models. ManufactingProcess32 still is deemed the most important. The other predictors have less variable importance. BiologicalMaterial06 was deemed only the seventh most important, where it was the second most important in other variables. There are some predictors that were not in the top 10 previously, that are in the top 10 now, such as Manufacting Processes number 17, 4, 33, and 10.
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
plot(as.party(rpartTree), ip_args = list(abbreviate = 4), gp = gpar(fontsize = 7))