To solve problems 8.1, 8.2, 8.3, and 8.7, we can use the R examples and computing techniques referenced in Chapter 8. Below are R solutions adapted for these problems:
Steps:
mlbench
package.Based on the variable importance output from the Random Forest model shown in the table below:
V6–V10 have very low importance values compared to V1–V5. Their contributions are minimal or even negative, indicating that these variables were not used effectively by the model to reduce prediction error.
The model did not significantly utilize the uninformative predictors (V6–V10).
# Load required libraries
library(mlbench)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
# Simulate data
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# Fit Random Forest Model
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
# Variable Importance
rfImp1 <- varImp(model1, scale = FALSE)
print(rfImp1)
## Overall
## V1 8.732235404
## V2 6.415369387
## V3 0.763591825
## V4 7.615118809
## V5 2.023524577
## V6 0.165111172
## V7 -0.005961659
## V8 -0.166362581
## V9 -0.095292651
## V10 -0.074944788
Question: Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
Answer: Traditional importance
(conditional = FALSE
) shows diluted importance for
V1
due to correlation with duplicate1
and
duplicate2
. Conditional importance
(conditional = TRUE
) correctly adjusts for this,
highlighting V1
’s true contribution. Conditional importance
is better at handling correlated predictors.
# Load required libraries
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
# Set up parallel backend
numCores <- parallel::detectCores() - 1 # Use all but one core
cl <- makeCluster(numCores)
registerDoParallel(cl)
# Fit a conditional inference forest model with parallel processing
cforest_model <- cforest(
y ~ .,
data = simulated,
controls = cforest_unbiased(ntree = 1000)
)
# Calculate variable importance (traditional)
cforest_imp_traditional <- varimp(cforest_model, conditional = FALSE)
# Calculate variable importance (conditional)
cforest_imp_conditional <- varimp(cforest_model, conditional = TRUE)
# Combine the results into a data frame
importance_df <- data.frame(
Variable = names(cforest_imp_traditional),
Traditional_Importance = cforest_imp_traditional,
Conditional_Importance = cforest_imp_conditional
)
# Print the data frame
print(importance_df)
## Variable Traditional_Importance Conditional_Importance
## V1 V1 6.009566768 2.603605015
## V2 V2 6.207797592 4.991374161
## V3 V3 0.014141581 0.007755235
## V4 V4 7.078904806 5.676538204
## V5 V5 1.777784489 1.137300318
## V6 V6 0.002599111 -0.010147657
## V7 V7 0.021512315 -0.003595551
## V8 V8 -0.038218250 -0.015512033
## V9 V9 -0.008663233 -0.008066696
## V10 V10 -0.049479513 -0.017441744
## duplicate1 duplicate1 1.949298911 0.571089877
## duplicate2 duplicate2 1.438289739 0.371323995
# Stop the parallel backend
stopCluster(cl)
registerDoSEQ() # Return to sequential execution
Procedure
summary
function for GBM.Answer: Yes, the same pattern occurs. Boosted trees
consistently identify key predictors (V4
, V1
,
V2
) across tuning settings, with relative rankings
remaining stable. However, the absolute importance values vary slightly,
with lower tuning parameters showing higher variability across
predictors.
# Load required libraries
library(caret)
library(Cubist)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Simulate the original data
# Adjust dataset size
set.seed(200)
simulated <- mlbench.friedman1(500, sd = 1) # Increase dataset size to 500
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# Split data into training and testing sets
set.seed(300)
trainIndex <- createDataPartition(simulated$y, p = 0.8, list = FALSE)
trainData <- simulated[trainIndex, ]
testData <- simulated[-trainIndex, ]
# Boosted Trees Model using gbm with adjusted parameters
set.seed(400)
gbm_model <- gbm(y ~ .,
data = trainData,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.1,
bag.fraction = 0.5, # Increased bag.fraction
n.minobsinnode = 5) # Reduced n.minobsinnode
gbm_imp_low <- summary(gbm_model)
# Adjust shrinkage and bag.fraction to 0.9
set.seed(500)
gbm_model_high <- gbm(y ~ .,
data = trainData,
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.9,
bag.fraction = 0.9,
n.minobsinnode = 5)
gbm_imp_high <- summary(gbm_model_high)
# Compile results into a data frame
boosted_importance <- data.frame(
Variable = gbm_imp_low$var,
Importance_Low = gbm_imp_low$rel.inf,
Importance_High = gbm_imp_high$rel.inf
)
# Print results
print("Boosted Trees Importance:")
## [1] "Boosted Trees Importance:"
print(boosted_importance)
## Variable Importance_Low Importance_High
## 1 V4 35.822166 35.802101
## 2 V1 21.858930 22.919984
## 3 V2 17.209438 15.881143
## 4 V3 9.883539 8.515685
## 5 V5 9.318789 7.512770
## 6 V9 1.433402 2.505251
## 7 V8 1.237092 2.044694
## 8 V7 1.118350 2.012220
## 9 V6 1.075841 1.406649
## 10 V10 1.042451 1.399503
Objective: Simulate and illustrate tree bias when predictors have different levels of granularity.
# Load required library
library(rpart)
library(rpart.plot)
# Simulate data
set.seed(123)
n <- 500
# Predictor with high granularity (continuous)
x1 <- runif(n, 0, 100)
# Predictor with low granularity (categorical)
x2 <- sample(letters[1:3], n, replace = TRUE)
# Another categorical predictor with higher levels
x3 <- sample(letters[1:10], n, replace = TRUE)
# Response variable based on x1 and some noise
y <- 3 * x1 + rnorm(n)
# Combine into a data frame
data <- data.frame(x1, x2, x3, y)
# Fit a regression tree
tree_model <- rpart(y ~ ., data = data)
# Plot the tree
rpart.plot(tree_model)
# Check variable importance
importance <- tree_model$variable.importance
importance_df <- data.frame(Variable = names(importance), Importance = importance)
print("Variable Importance:")
## [1] "Variable Importance:"
print(importance_df)
## Variable Importance
## x1 x1 3571080.6
## x3 x3 293364.4
## x2 x2 153269.8
bagging fraction = 0.9
,
shrinkage = 0.9
):
bagging fraction = 0.1
,
shrinkage = 0.1
):
The code loads the ChemicalManufacturingProcess
dataset from the AppliedPredictiveModeling
package, splits
it into predictors (processPredictors
) and a response
variable (yield
), which represents the percentage
yield.
processPredictors
,
excluding the final column (the yield), and the response variable is
assigned to yield
.# Load necessary library
if (!require("AppliedPredictiveModeling")) {
install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
}
## Loading required package: AppliedPredictiveModeling
# Load the data
data("ChemicalManufacturingProcess")
predictors <- ChemicalManufacturingProcess[, -ncol(ChemicalManufacturingProcess)] # Exclude the last column which is `yield`
response <- ChemicalManufacturingProcess$Yield # Assign response variable explicitly
# Review the data
#str(processPredictors)
#str(yield) # Response variable: percent yield
caret
’s preProcess
to fill missing
values.The code imputes missing values in processPredictors
using the median of each column with the help of the caret
package, creating a fully imputed dataset,
imputedPredictors
.
# Load necessary libraries
library(caret)
library(doParallel)
# Set up parallel processing
cores <- parallel::detectCores() - 1 # Use one less core than available
cl <- makeCluster(cores) # Create a parallel cluster
registerDoParallel(cl) # Register the parallel backend
# Create a preprocessing object to impute missing values with the median
preProcessObj <- preProcess(
predictors,
method = "medianImpute"
)
# Apply the preprocessing object to impute missing values in the dataset
# Parallel processing is used if the dataset is large or complex
imputedPredictors <- predict(preProcessObj, predictors)
# Stop the parallel backend
stopCluster(cl)
registerDoSEQ() # Return to sequential processing
# Load necessary libraries
library(AppliedPredictiveModeling)
library(caret)
library(rpart)
library(randomForest)
library(gbm)
library(rpart.plot)
# Identify predictors with near-zero variance
nzv <- nearZeroVar(imputedPredictors)
# Filter out those predictors
filteredPredictors <- imputedPredictors[, -nzv]
# Separate the response variable 'Yield' (first column) from the predictors
response <- filteredPredictors[, 1] # Extract the Yield column
predictors <- filteredPredictors[, -1] # Remove Yield from the predictors
# Check the number of predictors remaining
num_predictors <- ncol(filteredPredictors)
print(paste("Number of predictors left after removing near-zero variance:", num_predictors))
## [1] "Number of predictors left after removing near-zero variance: 56"
# Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(response, p = 0.8, list = FALSE)
trainPredictors <- imputedPredictors[trainIndex, ]
testPredictors <- imputedPredictors[-trainIndex, ]
trainResponse <- response[trainIndex]
testResponse <- response[-trainIndex]
# Train tree-based models
# 1. Single regression tree
set.seed(456)
treeModel <- rpart(trainResponse ~ ., data = data.frame(trainPredictors, trainResponse))
# 2. Random forest
set.seed(456)
rfModel <- randomForest(
trainResponse ~ .,
data = data.frame(trainPredictors, trainResponse),
importance = TRUE # Enable variable importance calculation
)
# 3. Gradient boosting
set.seed(456)
gbmModel <- gbm(
trainResponse ~ .,
data = data.frame(trainPredictors, trainResponse),
distribution = "gaussian",
n.trees = 1000,
interaction.depth = 3,
shrinkage = 0.1
)
# Evaluate models
treePred <- predict(treeModel, testPredictors)
rfPred <- predict(rfModel, testPredictors)
gbmPred <- predict(gbmModel, testPredictors, n.trees = 1000)
treeRMSE <- RMSE(treePred, testResponse)
rfRMSE <- RMSE(rfPred, testResponse)
gbmRMSE <- RMSE(gbmPred, testResponse)
performance <- data.frame(
Model = c("Single Tree", "Random Forest", "Gradient Boosting"),
RMSE = c(treeRMSE, rfRMSE, gbmRMSE)
)
# Variable importance for the best model (Random Forest in this case)
rfImportance <- data.frame(
Variable = rownames(importance(rfModel)),
Importance = importance(rfModel)[, "IncNodePurity"]
)
rfImportance <- rfImportance[order(-rfImportance$Importance), ]
Based on the RMSE values:
Gradient Boosting has the lowest RMSE (0.2566980), indicating it provides the best performance on the test set. The Single Tree model performs moderately well (RMSE = 0.2915351). The Random Forest has the highest RMSE (0.4941925), suggesting it is less effective for this dataset compared to the other models.
print("Model Performance:")
## [1] "Model Performance:"
print(performance)
## Model RMSE
## 1 Single Tree 0.2915351
## 2 Random Forest 0.4941925
## 3 Gradient Boosting 0.2566980
Question: Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
ManufacturingProcess32
,
ManufacturingProcess31
, and
ManufacturingProcess36
. These dominate the top ranks, with
ManufacturingProcess32
showing the highest importance.BiologicalMaterial06
and BiologicalMaterial03
, are also present but are less
influential compared to the process predictors.Process predictors dominate the tree-based regression model’s variable importance rankings, while biological variables contribute less but remain relevant. Compared to linear models, the rankings might differ due to the tree-based model’s ability to capture non-linear interactions. Nonlinear models, like boosting, might exhibit similar patterns.
print("Top 10 Important Predictors in Random Forest:")
## [1] "Top 10 Important Predictors in Random Forest:"
print(rfImportance[1:10, ])
## Variable Importance
## Yield Yield 224.257880
## ManufacturingProcess32 ManufacturingProcess32 56.301556
## BiologicalMaterial06 BiologicalMaterial06 20.183422
## ManufacturingProcess31 ManufacturingProcess31 15.282151
## BiologicalMaterial03 BiologicalMaterial03 14.810535
## ManufacturingProcess36 ManufacturingProcess36 13.890872
## ManufacturingProcess13 ManufacturingProcess13 11.304889
## BiologicalMaterial12 BiologicalMaterial12 9.405687
## BiologicalMaterial02 BiologicalMaterial02 8.851663
## ManufacturingProcess17 ManufacturingProcess17 8.542617
Question: Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield
Yes, the tree visualization provides valuable additional insights:
Yield
thresholds, showing how the
predictors divide the data into meaningful segments.Yield
.In summary, the tree adds interpretability to the relationship between predictors and yield, illustrating how variables segment the data and highlighting key areas for intervention.
# Plot the single tree with yield distribution in terminal nodes
rpart.plot(treeModel, box.palette = "RdBu", shadow.col = "gray", nn = TRUE)