Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
8.1. Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, the random forest model did not significantly use the uninformative predictors. The importance scores for V6–V10 were near zero or negative, especially when compared to the informative predictors (V1–V5), indicating they were treated as irrelevant noise.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
After fitting another forest model to this data, the importance score of V1 decreased from 8.73 to 5.50. The model sees V1 and duplicate1 as nearly identical so the score is split between the two variables.
# Adding another predictor that is highly correlated
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
## Overall
## V1 4.91687329
## V2 6.52816504
## V3 0.58711552
## V4 7.04870917
## V5 2.03115561
## V6 0.14213148
## V7 0.10991985
## V8 -0.08405687
## V9 -0.01075028
## V10 0.09230576
## duplicate1 3.80068234
## duplicate2 1.87721959
Adding another predictor that is also highly correlated with V1 decrease the V1 score even further to 4.33. With three highly correlated variables, the model distributes the importance across all three.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
library(party)
# 1. Fit the cforest model
cforest_model <- cforest(y ~ ., data = simulated,
controls = cforest_unbiased(ntree = 1000))
# 2. Calculate Unconditional Importance (Standard)
imp_unconditional <- varimp(cforest_model, conditional = FALSE)
# 3. Calculate Conditional Importance
imp_conditional <- varimp(cforest_model, conditional = TRUE)
# 4. Compare the results
print("Unconditional Importance:")
## [1] "Unconditional Importance:"
print(imp_unconditional)
## V1 V2 V3 V4 V5 V6
## 4.389829273 5.895994513 0.018773326 7.148348396 1.639542658 -0.019720513
## V7 V8 V9 V10 duplicate1 duplicate2
## 0.037680863 -0.026851969 -0.005669536 -0.022708935 4.225800961 1.159922187
print("Conditional Importance:")
## [1] "Conditional Importance:"
print(imp_conditional)
## V1 V2 V3 V4 V5 V6
## 1.754849270 4.715103754 0.010612108 5.736625783 0.988062464 -0.009498226
## V7 V8 V9 V10 duplicate1 duplicate2
## -0.008040813 -0.001296254 0.008023805 0.004988622 1.484276741 0.322110747
The unconditional importance scores show the same pattern as the traditional random forest, with V1 through V5 (and the duplicates) showing the highest importance.However, the conditional importance scores are different. In the conditional model, the score for V1 is significantly lower than in the traditional random forest. This suggests that the conditional model heavily penalized the correlated predictors (V1, duplicate1, duplicate2) because they provide nearly identical information.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
library(gbm)
# Boosted Tree
gbmModel <- gbm(y ~ ., data = simulated,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.01)
print("GBM (Boosted Tree) Importance")
## [1] "GBM (Boosted Tree) Importance"
summary(gbmModel, plotit = FALSE)
## var rel.inf
## V4 V4 27.7369784
## V2 V2 21.5805458
## V1 V1 14.7816726
## duplicate1 duplicate1 11.6172400
## V5 V5 11.1475758
## V3 V3 7.7281870
## duplicate2 duplicate2 1.6665294
## V7 V7 1.0751924
## V6 V6 1.0231036
## V9 V9 0.7765301
## V8 V8 0.4760772
## V10 V10 0.3903676
In the boosted tree model, the importance of V1 in relation to the duplicate1 and duplicate2 is similar to that of the random forest. V1 is most important followed by duplicate1 and duplicate2. Since boosted trees fits models on the errors of the previous trees, once V1 is used to explain the pattern, the importance of subsequent predictors decreases.
library(Cubist)
# Cubist
predictors <- simulated[, names(simulated) != "y"]
response <- simulated$y
cubistModel <- cubist(x = predictors, y = response)
cubistImp <- varImp(cubistModel)
print("Cubist Importance")
## [1] "Cubist Importance"
print(cubistImp)
## Overall
## V1 50
## V2 50
## V4 50
## V5 50
## duplicate1 50
## V3 0
## V6 0
## V7 0
## V8 0
## V9 0
## V10 0
## duplicate2 0
The Cubist model showed a different pattern than the random forest. It treated V1 and duplicate1 equally with a score of 50 and filtered out the weaker predictors like V6-10 and duplicate2. Whereas random forest and boosted trees give small non zero scores to almost all predictors, this cubist assigns predictors either a score of 50 or 0.
Use a simulation to show tree bias with different granularities.
library(randomForest)
set.seed(100)
n <- 200
# Create predictor with high granularity
X_high <- rnorm(n)
# Medium: Discrete with 10 levels
X_med <- sample(1:10, n, replace = TRUE)
# Low: Binary with 2 levels
X_low <- sample(0:1, n, replace = TRUE)
# Response variable
y <- rnorm(n)
sim_bias <- data.frame(y, X_high, X_med, X_low)
# Fit a Random Forest
rf_bias <- randomForest(y ~ ., data = sim_bias,
ntree = 1000,
importance = TRUE)
# 4. Show the bias
print("Importance Scores")
## [1] "Importance Scores"
print(importance(rf_bias))
## %IncMSE IncNodePurity
## X_high -11.807776 35.591338
## X_med -3.648160 19.478221
## X_low 1.447747 8.381126
In this simulation, all three predictors are just random noise, so none of them should be important. However, the random forest still gives higher IncNodePurity scores to the variable with the most granular values (X_high) and lower scores to the binary variable (X_low). This happens because X_high has more possible split points, which can look important by random chance. In contrast, X_low has lower possible split points so it appears less important than X_high. This indicates that tree models are biased towards predictors with more granular values even when they are just random noise.
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right has a larger bagging fraction and learning rate so each boosting step makes a large update using almost the full dataset. This causes the model to choose the predictors that reduce error fastest and gives them the most importance. The model on the left has a low learning rate and bagging fraction so each boosting step is weaker and based on a smaller sample. The model does not commit to the strongest variable right away so it ends up using more predictors across the bossting process and spreading the importance out across more predictors.
Which model do you think would be more predictive of other samples?
The model on the left with a 0.1 learning rate and bagging fraction would be more predictive of other samples. This model uses a lower learning rate and smaller bagging fraction, which forces the algorithm to learn slowly and utilize a broader range of predictors. In contrast, the model on the right (with 0.9 learning rate and bagging fraction) is likely overfitted since it looks at almost the full dataset and picks the predictors that reduce the error fastest. This causes the algorithm to focus heavily on the dominant predictors while ignoring subtler relationships that might be important for new data.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth would make the importance scores more evenly distributed and flatten the slope of the variable importance plots. By increasing the interaction depth, the model will utilize combiantions of variable to reduce error. This allows predictors that are weak individually to contribute more to the model. As more predictors share responsibility for reducing error, the importance scores become less concentrated in the top few variables and distributed amongst all variables.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Separating the target (Yield) and predictor variables
yield <- ChemicalManufacturingProcess$Yield
processPredictors <- ChemicalManufacturingProcess[, -1]
# Impute missing values with the median
preProc <- preProcess(processPredictors, method = "medianImpute")
processPredictors_imputed <- predict(preProc, processPredictors)
set.seed(1234)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors_imputed[trainIndex, ]
testX <- processPredictors_imputed[-trainIndex, ]
trainY <- yield[trainIndex]
testY <- yield[-trainIndex]
ctrl <- trainControl(method = "cv", number = 10)
Which tree-based regression model gives the optimal resampling and test set performance?
set.seed(123)
# Single Tree (CART)
cartFit <- train(x = trainX, y = trainY,
method = "rpart",
trControl = ctrl,
tuneLength = 10)
set.seed(123)
# Random Forest
rfFit <- train(x = trainX, y = trainY,
method = "rf",
trControl = ctrl,
importance = TRUE,
ntree = 1000)
set.seed(123)
# Boosted Trees (GBM)
gbmFit <- train(x = trainX, y = trainY,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
tuneLength = 5)
set.seed(123)
# Cubist
cubistFit <- train(x = trainX, y = trainY,
method = "cubist",
trControl = ctrl)
set.seed(123)
# Results
results <- resamples(list(CART = cartFit,
RandomForest = rfFit,
BoostedTree = gbmFit,
Cubist = cubistFit))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: CART, RandomForest, BoostedTree, Cubist
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.7872247 0.9165359 1.1951854 1.1501016 1.3472167 1.460627 0
## RandomForest 0.5294602 0.7094446 0.8580053 0.8786830 1.0176632 1.268532 0
## BoostedTree 0.4562075 0.7371038 0.9515261 0.8845257 0.9628032 1.171705 0
## Cubist 0.3640587 0.6261434 0.8071463 0.7505923 0.8917524 1.060118 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 1.0493140 1.2032737 1.469353 1.455710 1.751796 1.847756 0
## RandomForest 0.7146588 0.8669454 1.115560 1.120061 1.268427 1.624538 0
## BoostedTree 0.5428109 0.9775802 1.119896 1.127509 1.348215 1.591389 0
## Cubist 0.5406720 0.7924334 1.035473 1.013169 1.207064 1.473122 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## CART 0.1892278 0.3243011 0.4342645 0.4415294 0.5579403 0.6461616 0
## RandomForest 0.5082836 0.5729517 0.7034863 0.6752660 0.7524627 0.8542659 0
## BoostedTree 0.4907246 0.5166832 0.6276451 0.6355303 0.7054676 0.9028644 0
## Cubist 0.5647881 0.5860383 0.6990217 0.7207328 0.8501100 0.9266960 0
# Test results
set.seed(123)
calc_perf <- function(model, test_x, test_y, model_name) {
preds <- predict(model, test_x)
perf <- postResample(pred = preds, obs = test_y)
return(c(Model = model_name, RMSE = perf["RMSE"], Rsquared = perf["Rsquared"]))
}
test_results <- rbind(
calc_perf(cartFit, testX, testY, "CART (Single Tree)"),
calc_perf(rfFit, testX, testY, "Random Forest"),
calc_perf(gbmFit, testX, testY, "Boosted Tree"),
calc_perf(cubistFit, testX, testY, "Cubist")
)
print("Test Results")
## [1] "Test Results"
print(test_results)
## Model RMSE.RMSE Rsquared.Rsquared
## [1,] "CART (Single Tree)" "1.29110133223877" "0.373181936246077"
## [2,] "Random Forest" "0.971339718407572" "0.670376957618169"
## [3,] "Boosted Tree" "0.959094368792033" "0.652207075268006"
## [4,] "Cubist" "1.01915040440364" "0.619183907737866"
During resampling, the cubist model yielded the best performance with the lowest mean RMSE (1.015) and highest mean R^2 (.721). However, the boosted tree model achieved the optimal test set performance with lowest RMSE (0.959) and second highest R^2 (0.652).
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
# Extract importance from the Boosted Tree model
gbmImp <- varImp(gbmFit, scale = FALSE)
# Print the top 10 most important variables
print(head(gbmImp$importance[order(-gbmImp$importance$Overall), , drop = FALSE], 10))
## Overall
## ManufacturingProcess32 418.59791
## BiologicalMaterial12 225.25356
## ManufacturingProcess36 137.94832
## ManufacturingProcess31 106.90945
## ManufacturingProcess13 106.27378
## ManufacturingProcess06 83.39040
## ManufacturingProcess17 79.89824
## ManufacturingProcess09 73.37946
## ManufacturingProcess05 59.28833
## BiologicalMaterial09 57.09107
In the boosted tree model, ManufacturingProcess32 was the most important variable with a score of 482—over double the amount of the next predictor, BiologicalMaterial12. The list is dominated by ManufacturingProcess variables, which account for 7 of the top 10 predictors.
These results are consistent with previous findings: the top 10 important factors of this boosted tree model are similar to those from the optimal linear and nonlinear models. ManufacturingProcess32 seems to be consistently the top predictor across all model types, and the manufacturing process variables consistently dominate the biological variables in importance.”
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
library(rpart.plot)
rpart.plot(cartFit$finalModel,
type = 4,
extra = 101,
main = "Optimal Single Tree for Yield")
Yes, this view of the data does provide additional knowledge as it reveals the specific splits and thresholds of the relationships. While the variable importance list just identified the top predictors, the tree shows the decision path and defines the cut-off values required to achieve a high yield. The tree also shows the interaction between predictors. It shows the specific combination of processes required to reach the terminal node with the highest yield, which was not visible in the variable importance ranking.