8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
V6 through V10 were not significantly used by the Random Forest model when compared to V1 through V5
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
The importance of V1 was found to be 5.69119973, and the importance of duplicate1 was shown to be 4.28331581. This indicates that the importance of V1 has been distributed with the addition of the duplicate1 variable.
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000,
mytr = 2)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
The importance score for V1 has decreased significantly when comparing to the traditional importance measure, and the importance for duplicate1 has also decreased. These results indicate that the conditional measure from the party package does show a different pattern of variable importance.
controls <- cforest_control(ntree = 1000, mtry = 2)
cf_model <- cforest(y ~ ., data = simulated, controls = controls)
cf_imp <- varimp(cf_model, conditional = FALSE)
cf_imp_conditional <- varimp(cf_model, conditional = TRUE)
print(cf_imp)
## V1 V2 V3 V4 V5 V6
## 3.166087285 3.557760232 0.083675313 3.929311188 1.127230728 -0.008267497
## V7 V8 V9 V10 duplicate1
## -0.004826123 -0.044799674 0.057038436 -0.018890770 3.418620783
print(cf_imp_conditional)
## V1 V2 V3 V4 V5 V6
## 1.428160582 2.634637732 0.053824081 2.913744526 0.634624879 0.001987701
## V7 V8 V9 V10 duplicate1
## -0.040865904 -0.011076562 0.013321626 -0.002146922 1.430874011
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
As with rf, the importance of Cubit and Ada Boost was in the following order: V4, V2, V1, and V5.
data <- simulated %>% select(-y)
fit_cubist <- cubist(x = data, y = simulated$y, committees = 4)
summary(fit_cubist)
##
## Call:
## cubist.default(x = data, y = simulated$y, committees = 4)
##
##
## Cubist [Release 2.07 GPL Edition] Sat Apr 13 13:51:40 2024
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 200 cases (12 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.936506]
##
## outcome = 0.269253 + 8.9 V4 + 7.1 V2 + 5.1 V5 + 4.8 V1 + 3.2 duplicate1
##
## Model 2:
##
## Rule 2/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 1.990785]
##
## outcome = 0.826137 + 9 V4 + 8.3 V1 + 7.3 V2 + 5.2 V5 - 3 V6
##
## Model 3:
##
## Rule 3/1: [105 cases, mean 13.381248, range 3.55596 to 23.3956, est err 2.029922]
##
## if
## V1 <= 0.7340099
## V3 <= 0.654213
## then
## outcome = 2.658355 - 12.6 V3 + 11.6 duplicate1 + 10.2 V4 + 7.8 V2
## + 2.4 V6 + 1.5 V1 + 0.5 V5
##
## Rule 3/2: [20 cases, mean 14.639552, range 8.442596 to 21.62877, est err 2.450924]
##
## if
## V1 > 0.7340099
## V2 <= 0.5403168
## then
## outcome = 2.108552 + 35 V2 + 10.4 V4 - 6 V3 + 1.3 duplicate1 + 0.8 V5
##
## Rule 3/3: [57 cases, mean 14.914219, range 4.888355 to 28.38167, est err 2.814725]
##
## if
## V1 <= 0.7340099
## V3 > 0.654213
## then
## outcome = -21.377814 + 25.2 V3 + 11.3 V4 + 11 V1 + 8.1 V2 + 7.1 V5
##
## Rule 3/4: [18 cases, mean 18.628002, range 13.07191 to 23.57269, est err 2.682001]
##
## if
## V1 > 0.7340099
## V2 > 0.5403168
## then
## outcome = 43.992161 - 34.9 V2 + 0.2 V4
##
## Model 4:
##
## Rule 4/1: [200 cases, mean 14.416183, range 3.55596 to 28.38167, est err 2.058539]
##
## outcome = 0.1879 + 9.1 V4 + 7.9 V1 + 7 V5 + 7.2 V2 - 3.1 V6
##
##
## Evaluation on training data (200 cases):
##
## Average |error| 1.730346
## Relative |error| 0.43
## Correlation coefficient 0.90
##
##
## Attribute usage:
## Conds Model
##
## 25% 95% V1
## 20% 23% V3
## 5% 100% V2
## 100% V4
## 98% V5
## 63% V6
## 41% duplicate1
##
##
## Time: 0.0 secs
#str(simulated)
#summary(simulated$y)
set.seed(123)
simulated$y <- factor(ifelse(simulated$y > median(simulated$y), "High", "Low"))
set.seed(123)
model_ada <- boosting(y ~ ., data=simulated, boos=TRUE, mfinal=500)
summary(model_ada)
## Length Class Mode
## formula 3 formula call
## trees 500 -none- list
## weights 500 -none- numeric
## votes 400 -none- numeric
## prob 400 -none- numeric
## class 200 -none- character
## importance 11 -none- numeric
## terms 3 terms call
## call 5 -none- call
importance <- model_ada$importance
importance
## duplicate1 V1 V10 V2 V3 V4 V5
## 8.621045 9.116329 6.009605 15.402413 8.459015 17.841121 11.259992
## V6 V7 V8 V9
## 6.019669 5.593715 6.154161 5.522935
barplot(importance, main="Variable Importance", col='lightblue',
ylab="Importance", las=2, cex.names=0.8)
Use a simulation to show tree bias with different granularities.
set.seed(624)
generate_data <- function(n, scales) {
data <- as.data.frame(matrix(nrow = n, ncol = length(scales)))
colnames(data) <- paste0("var", scales)
for (i in seq_along(scales)) {
scale <- scales[i]
data[[i]] <- runif(n, min = 0, max = 1) + rnorm(n, sd = 1 / scale)
}
return(data)
}
scales <- c(10, 100, 1000, 10000, 100000)
simData <- generate_data(500, scales)
simData$y <- rowSums(simData)
rpartTree <- rpart(y ~ ., data = simData, control = rpart.control(minsplit = 10, cp = 0.01))
rpart.plot(rpartTree, main = "Decision Tree Model", extra = 100) # extra = 100 shows all splits
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24
provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The right model’s high bagging and learning rates allow it to quickly converge to a solution that fits most patterns in the data, focusing importance on only a few predictors. On the other hand, the left model takes a more cautious approach, with more diverse sampling and smaller steps, taking into account a wider range of predictors.
Which model do you think would be more predictive of other samples?
Using a high bagging ratio and learning rate, each tree learns on a large subset of the data and contributes significantly to the final model. This causes the model to rely on the strong signals it initially identifies, which can lead to overfitting if these signals are only present in the training data and do not generalize to other samples.
Let’s assume that one input variable ‘A’ is very important for prediction and the others play a minor role. In this case, most of the bagged trees will use the input variable ‘A’ for best branch splitting. In the end, even though multiple trees were fitted to improve performance, most of the trees ended up having similar (correltated) shapes. Due to the nature of bagging, which takes the average of multiple trees, if the results of each model (tree) are similar, the results will be similar even if the average is taken.
In conclusion, models using lower bagging and learning rates are more likely to provide more stable and general predictions across a variety of samples.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
High interaction depth models can capture interactions between more complex variables, and combinations of important variables can further dominate the importance. Therefore, as the interaction depth increases, the slope will be steeper from top to bottom.
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
The \(LightGBM\) model outperforms the RandomForest model in terms of both resampling (validation) and test set performance. LightGBM tends to be more effective for handling large datasets with many features due to its efficient handling of categorical features and faster training times, which might also contribute to its better performance in this scenario.
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
preProc_bagImpute <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
ChemicalManufacturing_imputed <- predict(preProc_bagImpute, ChemicalManufacturingProcess)
set.seed(123)
idx <- createDataPartition(ChemicalManufacturing_imputed$Yield, p = 0.8, list = FALSE)
trainData <- ChemicalManufacturing_imputed[idx, ]
testData <- ChemicalManufacturing_imputed[-idx, ]
#trainData
\(Random Forest\)
set.seed(123)
rf_model <- randomForest(Yield ~ ., data = trainData, ntree = 500, mtry = 3)
print(rf_model)
##
## Call:
## randomForest(formula = Yield ~ ., data = trainData, ntree = 500, mtry = 3)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 1.470544
## % Var explained: 56.81
predictions <- predict(rf_model, testData)
actual <- testData$Yield
mse <- mean((predictions - actual)^2)
print(paste("Mean Squared Error on Test Set:", mse))
## [1] "Mean Squared Error on Test Set: 1.73429789566514"
sse <- sum((predictions - actual)^2)
sst <- sum((actual - mean(actual))^2)
rsquared <- 1 - sse/sst
print(paste("R-squared for the model:", rsquared))
## [1] "R-squared for the model: 0.474429287202656"
\(LightGBM\)
train_data <- lgb.Dataset(data = as.matrix(trainData[, -which(names(trainData) == "Yield")]), label = trainData$Yield)
test_data <- as.matrix(testData[, -which(names(testData) == "Yield")])
test_label <- testData$Yield
params <- list(
objective = "regression",
metric = "rmse",
learning_rate = 0.1,
num_leaves = 31,
feature_fraction = 0.9,
bagging_fraction = 0.8,
bagging_freq = 5,
verbose = -1
)
set.seed(123)
lgb_model <- lgb.train(
params = params,
data = train_data,
nrounds = 100,
valids = list(test = train_data),
early_stopping_rounds = 10,
verbose = 1
)
predictions <- predict(lgb_model, test_data, num_iteration = lgb_model$best_iter)
mse <- mean((predictions - test_label)^2)
print(paste("Mean Squared Error on Test Set:", mse))
## [1] "Mean Squared Error on Test Set: 1.40559132637084"
sse <- sum((predictions - test_label)^2)
sst <- sum((test_label - mean(test_label))^2)
rsquared <- 1 - sse/sst
print(paste("R-squared for the model:", rsquared))
## [1] "R-squared for the model: 0.574042246635394"
\(Bagged Tree\)
train_data <- trainData[, -which(names(trainData) == "Yield")]
train_label <- trainData$Yield
test_data <- testData[, -which(names(testData) == "Yield")]
test_label <- testData$Yield
set.seed(123)
bagged_model <- randomForest(x = train_data, y = train_label, ntree = 500, mtry = length(train_data))
print(bagged_model)
##
## Call:
## randomForest(x = train_data, y = train_label, ntree = 500, mtry = length(train_data))
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 57
##
## Mean of squared residuals: 1.40519
## % Var explained: 58.73
predictions <- predict(bagged_model, test_data)
mse <- mean((predictions - test_label)^2)
print(paste("Mean Squared Error on Test Set:", mse))
## [1] "Mean Squared Error on Test Set: 1.74913471486399"
sse <- sum((predictions - test_label)^2)
sst <- sum((test_label - mean(test_label))^2)
rsquared <- 1 - sse/sst
print(paste("R-squared for the model:", rsquared))
## [1] "R-squared for the model: 0.469933059846631"
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list?
Manufacturing 32 was found to be the most important. Additionally, in the tree-based model, process variables dominated the list.
importance <- lgb.importance(model = lgb_model, percentage = TRUE)
lgb.plot.importance(importance, top_n = 20, measure = "Gain")
print(importance)
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: ManufacturingProcess32 0.3802009001 0.091566088 0.070175439
## 2: ManufacturingProcess17 0.0781227838 0.041124032 0.038011696
## 3: BiologicalMaterial12 0.0470911711 0.047287135 0.043859649
## 4: BiologicalMaterial06 0.0453942540 0.013280018 0.017543860
## 5: ManufacturingProcess31 0.0447615703 0.023258373 0.017543860
## 6: ManufacturingProcess06 0.0281527589 0.031292417 0.026315789
## 7: ManufacturingProcess01 0.0278037611 0.042811548 0.038011696
## 8: BiologicalMaterial03 0.0266507294 0.033236729 0.035087719
## 9: ManufacturingProcess27 0.0235844623 0.012583000 0.017543860
## 10: ManufacturingProcess25 0.0199568599 0.022781467 0.026315789
## 11: ManufacturingProcess28 0.0166488568 0.031989435 0.032163743
## 12: ManufacturingProcess15 0.0153004301 0.024872519 0.023391813
## 13: ManufacturingProcess04 0.0141628651 0.035584578 0.035087719
## 14: ManufacturingProcess20 0.0139339323 0.012142779 0.014619883
## 15: ManufacturingProcess09 0.0128348852 0.016801790 0.014619883
## 16: ManufacturingProcess02 0.0119347437 0.017095271 0.020467836
## 17: ManufacturingProcess19 0.0113225605 0.026963572 0.029239766
## 18: ManufacturingProcess05 0.0112312760 0.031365787 0.032163743
## 19: ManufacturingProcess45 0.0102600214 0.039913423 0.032163743
## 20: ManufacturingProcess11 0.0101392960 0.014417257 0.011695906
## 21: ManufacturingProcess24 0.0098606974 0.010675373 0.014619883
## 22: ManufacturingProcess37 0.0097072425 0.030301919 0.023391813
## 23: ManufacturingProcess39 0.0097043801 0.015554496 0.017543860
## 24: BiologicalMaterial04 0.0094438411 0.024212187 0.026315789
## 25: BiologicalMaterial08 0.0090883457 0.013280018 0.017543860
## 26: ManufacturingProcess16 0.0088773898 0.030595400 0.029239766
## 27: BiologicalMaterial05 0.0074043170 0.027146997 0.035087719
## 28: BiologicalMaterial02 0.0073869090 0.020360248 0.023391813
## 29: ManufacturingProcess03 0.0071893683 0.015224330 0.014619883
## 30: BiologicalMaterial09 0.0068390192 0.015921347 0.017543860
## 31: ManufacturingProcess36 0.0067607849 0.008254155 0.008771930
## 32: ManufacturingProcess13 0.0065738031 0.018636047 0.020467836
## 33: ManufacturingProcess33 0.0061878148 0.018049085 0.020467836
## 34: BiologicalMaterial01 0.0054455918 0.020286878 0.017543860
## 35: ManufacturingProcess38 0.0047845474 0.015334385 0.017543860
## 36: ManufacturingProcess07 0.0047496648 0.009758245 0.011695906
## 37: BiologicalMaterial10 0.0043119554 0.012436260 0.011695906
## 38: ManufacturingProcess35 0.0037095695 0.016324884 0.014619883
## 39: ManufacturingProcess18 0.0035213474 0.011409076 0.014619883
## 40: ManufacturingProcess14 0.0034251958 0.006493268 0.005847953
## 41: ManufacturingProcess43 0.0032871547 0.004255475 0.005847953
## 42: ManufacturingProcess10 0.0019199629 0.008914487 0.008771930
## 43: ManufacturingProcess22 0.0017908023 0.007887303 0.008771930
## 44: ManufacturingProcess21 0.0016132463 0.001467405 0.002923977
## 45: ManufacturingProcess23 0.0015244507 0.006163102 0.008771930
## 46: ManufacturingProcess29 0.0014334390 0.004952493 0.005847953
## 47: ManufacturingProcess26 0.0011403304 0.005612825 0.005847953
## 48: ManufacturingProcess44 0.0008770715 0.004255475 0.005847953
## 49: ManufacturingProcess30 0.0007679439 0.002421219 0.002923977
## 50: ManufacturingProcess42 0.0007308898 0.001980997 0.002923977
## 51: BiologicalMaterial11 0.0004548051 0.001467405 0.002923977
## Feature Gain Cover Frequency
How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models? Tree-based, non-linear, and linear models all view Manufacturing 32 as the most important. Also, if you look at only 10 predictors, process variables commonly dominate the list. However, the important variables show slight differences in ranking..
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
This visualization can offer valuable insights into the factors impacting yield. By examining the splits and distributions in the terminal nodes, valuable insights can be gained about the relationships between predictors and yield. For example, if a specific biological material frequently appears in splits leading to higher yield outcomes, one could infer that this material positively influences yield.
#library(partykit)
#rpartTree <- rpart(Yield ~ ., data = ChemicalManufacturing_imputed[idx, ])
#plot(as.party(rpartTree), ti_args = list(abbreviate = 4), gp = gpar(fontsize = 7))
image_path <- "C:/Users/SeungminSong/Downloads/624R/plotsaved.png"
img <- readPNG(image_path)
raster_image <- rasterGrob(img, interpolate = TRUE)
grid.newpage()
grid.draw(raster_image)