recreate this code :
library(mlbench)
library(Cubist)
## Loading required package: lattice
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
library(AppliedPredictiveModeling)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
The random forest model demonstrated that the informative predictors (V1–V5) had much higher importance values, while the uninformative predictors (V6–V10) exhibited importance scores near zero or even slightly negative. This indicates that the model did not significantly utilize the uninformative predictors when building the trees. In summary, the random forest correctly concentrated on the relevant predictors (V1–V5) and effectively ignored the irrelevant ones (V6–V10), as expected.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
model1 <- randomForest(y ~ ., data = simulated,
importance =TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE
)
rfImp1
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model2 <- randomForest(y ~ ., data = simulated,
importance =TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE
)
rfImp2
## Overall
## V1 5.69119973
## V2 6.06896061
## V3 0.62970218
## V4 7.04752238
## V5 1.87238438
## V6 0.13569065
## V7 -0.01345645
## V8 -0.04370565
## V9 0.00840438
## V10 0.02894814
## duplicate1 4.28331581
Yes, the importance score for V1 dropped from around 9.07 (original) to 6.10 after adding the duplicate variable.After adding a new predictor (duplicate1) that is highly correlated with V1, the random forest model showed a noticeable change in the importance scores. The importance value for V1 decreased from its original value, and the new variable, duplicate1, also gained a significant importance score. This indicates that the model split the predictive importance between V1 and its duplicate, rather than concentrating it all on the original variable. In other words, when two highly correlated predictors are present, random forests tend to divide the importance between them, reducing the individual importance of each.
c)Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
model4 <- cforest(y ~ ., data = simulated[, c(1:11)])
rfImp4 <- varimp(model4, conditional = TRUE)
rfImp4
## V1 V2 V3 V4 V5 V6 V7
## 6.1752967 5.0261327 0.1960935 6.2417122 1.2772163 -0.2372838 -0.1358113
## V8 V9 V10
## -0.2235061 -0.2228211 -0.3024928
library(caret)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Boosted Tree Model
boosted_model <- train(y ~ ., data = simulated,
method = "gbm",
verbose = FALSE)
# Variable importance
varImp(boosted_model)
## gbm variable importance
##
## Overall
## V4 100.0000
## V2 71.4235
## V1 64.9137
## V5 47.0234
## duplicate1 34.4863
## V3 27.2596
## V7 3.5976
## V6 2.0623
## V9 0.7852
## V8 0.5350
## V10 0.0000
# Cubist Model
cubist_model <- train(y ~ ., data = simulated,
method = "cubist")
# Variable importance
varImp(cubist_model)
## cubist variable importance
##
## Overall
## V2 100.00
## V1 89.52
## V4 80.65
## V3 67.74
## duplicate1 59.68
## V5 50.00
## V6 25.00
## V9 0.00
## V10 0.00
## V7 0.00
## V8 0.00
When fitting boosted trees to the simulated data, the variable importance scores showed a similar pattern to the traditional random forest model: the importance of V1 decreased, and the highly correlated duplicate variable gained a non-trivial amount of importance. This indicates that boosted trees also tend to split importance among correlated predictors. However, when fitting the Cubist model, V1 maintained a high level of importance, while the duplicate variable showed no importance at all. This suggests that Cubist was able to preferentially select the more relevant predictor and largely ignore the redundant, correlated variable. Overall, while boosted trees exhibited the same splitting behavior as random forests, Cubist displayed a different pattern by not splitting importance between correlated predictors.
Exercise 8.2 Use a simulation to show tree bias with different granularities.
set.seed(624)
library(mlbench)
set.seed(123)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# Step 2: Fit simple trees (low depth)
library(rpart)
simple_tree <- rpart(y ~ ., data = simulated, control = rpart.control(maxdepth = 2))
# Step 3: Fit complex trees (high depth)
complex_tree <- rpart(y ~ ., data = simulated, control = rpart.control(maxdepth = 10))
# Step 4: Variable importance
simple_importance <- simple_tree$variable.importance
complex_importance <- complex_tree$variable.importance
# Step 5: Compare
print(simple_importance)
## V4 V2 V8 V1 V3 V6 V9
## 1447.70777 905.46545 251.76510 167.13969 141.07707 137.33708 103.09077
## V10 V5 V7
## 54.88233 30.38904 27.44116
print(complex_importance)
## V4 V2 V1 V8 V6 V5 V9 V3
## 1821.5226 1124.7965 405.5857 373.6558 316.4571 273.4891 195.1682 190.4739
## V7 V10
## 173.4785 166.9086
# Step 1: Get importance
simple_importance <- simple_tree$variable.importance
complex_importance <- complex_tree$variable.importance
# Step 2: Turn into data frames
simple_df <- data.frame(
Variable = names(simple_importance),
Importance = as.numeric(simple_importance)
)
complex_df <- data.frame(
Variable = names(complex_importance),
Importance = as.numeric(complex_importance)
)
# Step 3: Order nicely for plotting
simple_df <- simple_df[order(simple_df$Importance), ]
complex_df <- complex_df[order(complex_df$Importance), ]
# Step 4: Load ggplot2
library(ggplot2)
# Step 5: Create plots
library(gridExtra) # for side-by-side plots
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
plot_simple <- ggplot(simple_df, aes(x = Importance, y = reorder(Variable, Importance))) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Simple Tree (Low Depth)", x = "Importance", y = "Variable") +
theme_minimal()
plot_complex <- ggplot(complex_df, aes(x = Importance, y = reorder(Variable, Importance))) +
geom_bar(stat = "identity", fill = "darkorange") +
labs(title = "Complex Tree (High Depth)", x = "Importance", y = "Variable") +
theme_minimal()
# Step 6: Plot them side by side
grid.arrange(plot_simple, plot_complex, ncol = 2)
Exercise 8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The model on the right has both the bagging fraction and learning rate set to 0.9, meaning it uses almost all the data for each tree and makes large updates at each boosting iteration. As a result, it quickly locks onto the strongest predictors and heavily relies on them, ignoring weaker variables. In contrast, the model on the left (bagging fraction and learning rate both at 0.1) introduces much more randomness and smaller updates, forcing the model to spread importance across more predictors. The randomness prevents early overfitting and allows weaker predictors to contribute to the model.
The model on the left (with bagging fraction and learning rate at 0.1) is likely to be more predictive on new samples. Its higher randomness and smaller updates encourage better generalization by avoiding over-reliance on a few strong predictors. In contrast, the right-hand model, while fitting the training data more aggressively, is at greater risk of overfitting because it focuses almost exclusively on the most dominant variables, making it less robust to variations in unseen data.
Increasing the interaction depth would allow the models to capture more complex relationships between predictors. As a result, the importance scores might become more evenly distributed across predictors. Weaker predictors could gain more importance if they are involved in higher-order interactions with strong predictors. Therefore, the slope of the predictor importance plots would likely become less steep — importance would not drop off as sharply after the top few variables.
Exercise 8.7
library(ipred)
set.seed(42)
data(ChemicalManufacturingProcess)
# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)
# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]
set.seed(624)
# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)
# train
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]
# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]
# Singele Tree
set.seed(42)
cartTune <- train(train_x, train_y,
method = "rpart",
tuneLength = 10,
trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartPred <- predict(cartTune, test_x)
postResample(cartPred, test_y)
## RMSE Rsquared MAE
## 1.3210558 0.6408257 1.0306280
# Bagged Tree
set.seed(42)
baggedTree <- ipredbagg(train_y, train_x)
baggedPred <- predict(baggedTree, test_x)
postResample(baggedPred, test_y)
## RMSE Rsquared MAE
## 1.2397665 0.7127647 0.9135486
# ramdom forest
set.seed(42)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)
## RMSE Rsquared MAE
## 1.2403562 0.7357469 0.8945969
# Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(42)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)
## RMSE Rsquared MAE
## 1.2041025 0.7124169 0.8724615
#Cubist
set.seed(42)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 1.0433658 0.7635791 0.7995865
table_model <- rbind(cart = postResample(cartPred, test_y),
bagged = postResample(baggedPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y))
varImp(cubistTuned, scale = TRUE)
## cubist variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 34.51
## BiologicalMaterial12 32.74
## ManufacturingProcess17 31.86
## BiologicalMaterial06 29.20
## ManufacturingProcess11 24.78
## BiologicalMaterial11 23.01
## ManufacturingProcess09 22.12
## ManufacturingProcess10 18.58
## BiologicalMaterial02 17.70
## ManufacturingProcess33 17.70
## ManufacturingProcess16 15.93
## BiologicalMaterial03 15.04
## ManufacturingProcess04 15.04
## BiologicalMaterial08 15.04
## ManufacturingProcess39 12.39
## ManufacturingProcess31 12.39
## ManufacturingProcess42 12.39
## ManufacturingProcess25 11.50
## ManufacturingProcess29 11.50
plot(varImp(cubistTuned), top = 10)
Among the tree-based regression models evaluated — including a single decision tree (CART), bagged trees, random forest, boosted trees, and Cubist — the Cubist model demonstrated the best resampling and test set performance. Specifically, Cubist achieved the lowest root mean squared error (RMSE) of 1.043, the highest R-squared value of 0.764, and the lowest mean absolute error (MAE) of 0.800. These results indicate that the Cubist model was not only the most accurate in predicting yield but also explained the largest proportion of variance in the data. Therefore, the Cubist model was selected as the optimal tree-based regression model for the chemical manufacturing dataset.
he most important predictors in the Cubist model are primarily process variables, led by ManufacturingProcess32, which has the highest importance at 100. Other key process variables include ManufacturingProcess13, ManufacturingProcess17, ManufacturingProcess11, and ManufacturingProcess09.
library(rpart.plot)
# Build your tree
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])
# Plot the tree nicely
rpart.plot(rpartTree,
type = 2, # Split labels under the nodes
extra = 101, # Show % observations and mean yield
box.palette = "Blues", # Blue shades based on predicted yield
fallen.leaves = TRUE, # Terminal nodes neatly at the bottom
shadow.col = "gray", # Soft shadow around the boxes
cex = 0.7)
he decision tree highlights that ManufacturingProcess32 is the primary driver of yield, with a clear threshold around 159.5. Subsequent splits involve both biological and process variables, confirming their complementary importance. The distribution of yield in the terminal nodes shows distinct groups, suggesting that specific value ranges of key predictors significantly impact yield. This visualization provides additional, actionable insights into how process and biological factors interact to influence yield.