8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
Did the random forest model significantly use the uninformative predictors
The random forest model did not significantly use the uninformative predictors. The importance scores for the uninformative predictors are all 0. (V6 – V10)? (b) Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change?
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
The importance score for V1 did not change.
What happens when you add another predictor that is also highly correlated with V1?
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
When we add another predictor that is highly correlated with V1, the importance score for the other predictors decrease.
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
cfImp1 <- varimp(model4, conditional = FALSE)
cfImp1
## V1 V2 V3 V4 V5 V6 V7
## 8.3013758 6.3849498 0.2021611 7.3074416 2.1097038 -0.1612203 0.1021704
## V8 V9 V10
## -0.1382315 -0.1127212 -0.1345254
The importances show the same pattern as the traditional random forest model.
#Cubist Model
set.seed(100)
cubist <- train(y ~ ., data = simulated[, c(1:11)], method = "cubist")
rfImp6 <- varImp(cubist$finalModel, scale = FALSE)
rfImp6
This model shows the same pattern as the traditional random forest model.
#Boosted Trees
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbm_tuning_grid <- expand.grid(
interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10
)
set.seed(592)
gbm_model <- train(
y ~ .,
data = simulated[, 1:11],
method = "gbm",
tuneGrid = gbm_tuning_grid,
verbose = FALSE
)
rfImp5 <- varImp(gbm_model$finalModel, scale = FALSE)
rfImp5
They show the same pattern as the traditional random forest model.
8.2. Use a simulation to show tree bias with different granularities.
set.seed(592)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
rpart_tree <- rpart(y ~ ., data = simulated)
rpart_importance <- rpart_tree$variable.importance # Extract variable importance
rpart_importance
## V4 V2 V1 V5 V6 V3 V9 V8
## 1612.7016 1149.6641 1107.7297 486.8192 423.1049 379.0755 243.3718 223.2010
## V10 V7
## 146.0495 124.6534
plot(as.party(rpart_tree), gp = gpar(fontsize = 7))
In this case, the tree bias is shown with different granularities. The
variable importance is shown in the plot.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
TThe bagging fraction determines how much training data is used per iteration, and the learning rate controls how quickly the model learns. A smaller learning rate and bagging fraction lead to slower, more accurate learning, reducing overfitting. A higher learning rate and bagging fraction can cause overfitting by focusing too much on a few predictors.
The model on the left generalizes better with more iterations, reducing the weight of each predictor and improving accuracy on other samples.So I think that the model on the left would be more predictive of other samples.
Interaction depth refers to the number of splits or maximum nodes in a tree. As interaction depth increases, the importance of predictors grows, allowing smaller but important predictors to have a greater impact. This makes the slope steeper or increases it.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
set.seed(100)
data(ChemicalManufacturingProcess)
# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(624)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]
set.seed(592)
rfTune <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfpred <- predict(rfTune, test_x)
postResample(rfpred, test_y)
## RMSE Rsquared MAE
## 1.2685304 0.7169606 0.8994923
#Bagged tree Model
set.seed(592)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)
## RMSE Rsquared MAE
## 1.3201044 0.6712407 0.9782215
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(592)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)
## RMSE Rsquared MAE
## 1.1779112 0.7254734 0.8715403
set.seed(592)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 0.9804618 0.7964499 0.7363520
set.seed(592)
cartTune <- train(train_x, train_y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartPred <- predict(cartTune, test_x)
postResample(cartPred, test_y)
## RMSE Rsquared MAE
## 1.5225133 0.5105273 1.1917615
The lowest RMSE is the Cubist model
rbind(rf = postResample(rfpred, test_y),
bagged = postResample(baggedPred, test_y),
gbm = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y),
cart = postResample(cartPred, test_y))
## RMSE Rsquared MAE
## rf 1.2685304 0.7169606 0.8994923
## bagged 1.3201044 0.6712407 0.9782215
## gbm 1.1779112 0.7254734 0.8715403
## cubist 0.9804618 0.7964499 0.7363520
## cart 1.5225133 0.5105273 1.1917615
cubistImp <- varImp(cubistTuned$finalModel, scale = FALSE)
cubistImp
plot(varImp(cubistTuned), top = 10)
The manufacturing process variables dominate the list of important predictors.
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
plot(as.party(rpartTree), gp = gpar(fontsize = 7))