This document presents solutions to a series of exercises from Chapter 8 in the book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. Each problem builds on the concepts of data preprocessing, model evaluation, and interpretation in predictive modeling.
We will address the following problems one at a time:
Each solution is presented with supporting code, results, and explanations.
We recreate the data from Exercise 7.2 using the
mlbench.friedman1
function to simulate a training set of
200 samples with both informative and uninformative predictors.
# Load necessary libraries
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.2
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: lattice
# Set seed for reproducibility
set.seed(200)
# Generate training data
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
# Rename the response column
colnames(simulated)[ncol(simulated)] <- "y"
# View the structure of the data
str(simulated)
## 'data.frame': 200 obs. of 11 variables:
## $ V1 : num 0.534 0.584 0.59 0.691 0.667 ...
## $ V2 : num 0.648 0.438 0.588 0.226 0.819 ...
## $ V3 : num 0.8508 0.6727 0.4097 0.0334 0.7168 ...
## $ V4 : num 0.1816 0.6692 0.3381 0.0669 0.8032 ...
## $ V5 : num 0.929 0.1638 0.8941 0.6374 0.0831 ...
## $ V6 : num 0.3618 0.4531 0.0268 0.525 0.2234 ...
## $ V7 : num 0.827 0.649 0.179 0.513 0.664 ...
## $ V8 : num 0.421 0.845 0.35 0.797 0.904 ...
## $ V9 : num 0.5911 0.9282 0.0176 0.6899 0.397 ...
## $ V10: num 0.589 0.758 0.444 0.445 0.55 ...
## $ y : num 18.5 16.1 17.8 13.8 18.4 ...
summary(simulated)
## V1 V2 V3 V4
## Min. :0.002806 Min. :0.0004409 Min. :0.003296 Min. :0.00819
## 1st Qu.:0.217222 1st Qu.:0.2795877 1st Qu.:0.284872 1st Qu.:0.23178
## Median :0.513935 Median :0.5106664 Median :0.537307 Median :0.44458
## Mean :0.481461 Mean :0.5126936 Mean :0.508571 Mean :0.46939
## 3rd Qu.:0.680814 3rd Qu.:0.7326078 3rd Qu.:0.746155 3rd Qu.:0.68454
## Max. :0.998992 Max. :0.9840194 Max. :0.999923 Max. :0.99929
## V5 V6 V7 V8
## Min. :0.001756 Min. :0.002901 Min. :0.0003388 Min. :0.004698
## 1st Qu.:0.285408 1st Qu.:0.276340 1st Qu.:0.2092393 1st Qu.:0.289413
## Median :0.534330 Median :0.497598 Median :0.4688035 Median :0.497961
## Mean :0.517948 Mean :0.496841 Mean :0.4636166 Mean :0.512865
## 3rd Qu.:0.748180 3rd Qu.:0.749615 3rd Qu.:0.6752972 3rd Qu.:0.727624
## Max. :0.991577 Max. :0.998987 Max. :0.9943478 Max. :0.999561
## V9 V10 y
## Min. :0.0176 Min. :0.003172 Min. : 3.556
## 1st Qu.:0.2304 1st Qu.:0.317826 1st Qu.:10.756
## Median :0.5289 Median :0.535922 Median :14.556
## Mean :0.5016 Mean :0.542409 Mean :14.416
## 3rd Qu.:0.7218 3rd Qu.:0.789736 3rd Qu.:17.970
## Max. :0.9910 Max. :0.999992 Max. :28.382
We fit a Random Forest model to the data and examine variable importance to determine if uninformative predictors (V6–V10) are significantly used.
# Load necessary libraries
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Fit Random Forest model
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
# Extract variable importance scores
rfImp1 <- varImp(model1, scale = FALSE)
# Plot variable importance
plot(model1)
varImpPlot(model1)
Based on the %IncMSE and IncNodePurity plots:
The random forest model effectively ignored the uninformative predictors, as their importance scores were close to zero which is highlighting the robustness of random forests in dealing with irrelevant variables.
We fit a random forest model using the cforest
function
from the party
package to calculate conditional and
traditional variable importance.
suppressWarnings({
# Load necessary library
library(party)
library(mvtnorm)
# Fit cforest model
cforest_model <- cforest(y ~ ., data = simulated, controls = cforest_unbiased(ntree = 1000))
# Calculate variable importance with and without the conditional argument
varImp_conditional <- varimp(cforest_model, conditional = TRUE)
varImp_traditional <- varimp(cforest_model, conditional = FALSE)
# Compare results
print(varImp_conditional)
print(varImp_traditional)
})
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## V1 V2 V3 V4 V5
## 3.107729e+00 5.005805e+00 1.027607e-02 6.489574e+00 1.282084e+00
## V6 V7 V8 V9 V10
## -6.171396e-05 -4.284121e-03 -1.221591e-02 6.012296e-03 -4.230969e-02
## duplicate1
## 1.624059e+00
## V1 V2 V3 V4 V5 V6
## 6.627035365 6.124537847 0.031172868 7.813790932 1.925010455 -0.021243714
## V7 V8 V9 V10 duplicate1
## 0.022778062 -0.003569653 0.026782181 -0.040253690 3.079058923
The conditional inference trees () largely align with the traditional random forest model in identifying important predictors, though their scoring approach adjusts for collinearity.
We fit other tree-based models, such as boosted trees and Cubist, to compare variable importance and observe whether the same patterns emerge.
# Fit boosted tree model
boosted_model <- train(
y ~ .,
data = simulated,
method = "gbm",
trControl = trainControl(method = "cv"),
verbose = FALSE
)
# Summary of the model
summary(boosted_model)
## var rel.inf
## V4 V4 29.0272317
## V2 V2 21.8540345
## V1 V1 13.9012601
## duplicate1 duplicate1 12.9859875
## V5 V5 12.0259261
## V3 V3 8.2750841
## V6 V6 0.5839979
## V7 V7 0.5174679
## V8 V8 0.3984575
## V10 V10 0.2489757
## V9 V9 0.1815770
# Fit Cubist model
cubist_model <- train(
y ~ .,
data = simulated,
method = "cubist",
trControl = trainControl(method = "cv")
)
# Summary of the model
#summary(cubist_model)
Both boosted trees and Cubist models confirm the importance of the informative predictors (\(V_1, V_2, V_4\)) while effectively ignoring the uninformative ones (\(V_6\)–\(V_{10}\)). This consistency demonstrates the robustness of tree-based models in identifying relevant predictors.
suppressMessages({
# Load necessary libraries
library(rpart) # For decision trees
library(dplyr) # For data manipulation
library(ggplot2) # For visualization
# Set seed for reproducibility
set.seed(123)
# Simulate continuous variables
n <- 1000
continuous <- runif(n, 0, 1)
# Simulate ordinal variable (low, medium, high)
ordinal <- cut(continuous, breaks = 3, labels = c("low", "medium", "high"))
# Simulate categorical variable with different levels
categorical_2 <- sample(c("A", "B"), n, replace = TRUE)
categorical_5 <- sample(LETTERS[1:5], n, replace = TRUE)
categorical_10 <- sample(LETTERS[1:10], n, replace = TRUE)
# Combine into a data frame
data <- data.frame(
Continuous = continuous,
Ordinal = ordinal,
Categorical_2 = categorical_2,
Categorical_5 = categorical_5,
Categorical_10 = categorical_10,
Target = ifelse(continuous + as.numeric(ordinal) * 0.5 > 1.5, 1, 0)
)
# preview data
head(data)
})
## Continuous Ordinal Categorical_2 Categorical_5 Categorical_10 Target
## 1 0.2875775 low A A I 0
## 2 0.7883051 high B A C 1
## 3 0.4089769 medium B B H 0
## 4 0.8830174 high A B A 1
## 5 0.9404673 high B A B 1
## 6 0.0455565 low A C B 0
suppressWarnings({
library(rpart.plot)
tree_model <- rpart(Target ~ ., data = data)
# Visualize the tree
print(tree_model)
# Visualize the tree
rpart.plot(tree_model)
# Extract and print variable importance
importance <- tree_model$variable.importance
print("Variable Importance Scores:")
print(importance)
})
## n= 1000
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1000 249.951 0.493
## 2) Continuous< 0.5001242 507 0.000 0.000 *
## 3) Continuous>=0.5001242 493 0.000 1.000 *
## [1] "Variable Importance Scores:"
## Continuous Ordinal Categorical_10 Categorical_5 Categorical_2
## 249.951 168.831 21.294 5.070 3.549
From the variable importance scores:
Continuous: \(249.951\) — This feature is the most important, contributing significantly to reducing deviance.
Ordinal: \(168.831\) — The ordinal variable is the second most important but much less so than the continuous feature.
Categorical Variables:
\(\text{Categorical}_{10}\): \(21.294\)
\(\text{Categorical}_{5}\): \(5.070\)
\(\text{Categorical}_{2}\): \(3.549\)
These scores show a clear pattern: as the number of levels decreases, the importance of the categorical variables reduces further, indicating a bias toward features with more granularity.
This simulation demonstrates the bias of decision trees toward features with higher granularity (e.g., continuous variables). While this can enhance predictive power, it may lead to under-utilization of less granular but potentially valuable features, such as categorical variables with fewer levels.
Here is the visualization of Fig. 8.24, which supports our interpretation:
As seen in the figure above (Figure 1, referencing fig. 8.24 from the book):
We will use the ChemicalManufacturingProcess
dataset
from the AppliedPredictiveModeling
package and perform the
necessary data imputation, data splitting, and pre-processing.
suppressWarnings({
# Load necessary libraries
library(AppliedPredictiveModeling)
library(caret)
library(rpart)
library(randomForest)
library(gbm)
library(rpart.plot)
})
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# Load the data
data(ChemicalManufacturingProcess)
data <- ChemicalManufacturingProcess
# Impute missing values
preProcessSteps <- preProcess(data, method = "medianImpute")
data_imputed <- predict(preProcessSteps, data)
# Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(data_imputed$Yield, p = 0.8, list = FALSE)
trainData <- data_imputed[trainIndex, ]
testData <- data_imputed[-trainIndex, ]
# Pre-process data (center and scale predictors)
preProcessModel <- preProcess(trainData[, -ncol(trainData)], method = c("center", "scale"))
trainData[, -ncol(trainData)] <- predict(preProcessModel, trainData[, -ncol(trainData)])
testData[, -ncol(testData)] <- predict(preProcessModel, testData[, -ncol(testData)])
We will train three different tree-based regression models:
rpart
)randomForest
)gbm
)# Train a single decision tree
set.seed(123)
treeModel <- rpart(Yield ~ ., data = trainData)
treePred <- predict(treeModel, newdata = testData)
treeRMSE <- RMSE(treePred, testData$Yield)
# Train a random forest model
set.seed(123)
rfModel <- randomForest(Yield ~ ., data = trainData, ntree = 500, importance = TRUE)
rfPred <- predict(rfModel, newdata = testData)
rfRMSE <- RMSE(rfPred, testData$Yield)
# Train a gradient boosted machine model
set.seed(123)
gbmModel <- gbm(Yield ~ ., data = trainData, distribution = "gaussian", n.trees = 1000, interaction.depth = 3)
gbmPred <- predict(gbmModel, newdata = testData, n.trees = 1000)
gbmRMSE <- RMSE(gbmPred, testData$Yield)
# Compare RMSEs
results <- data.frame(
Model = c("Single Tree", "Random Forest", "Gradient Boosting"),
RMSE = c(treeRMSE, rfRMSE, gbmRMSE)
)
print(results)
## Model RMSE
## 1 Single Tree 0.9724927
## 2 Random Forest 0.6754056
## 3 Gradient Boosting 0.5946643
Based on the RMSE results, the Gradient Boosting
model gives the optimal performance with the lowest RMSE of
0.5947
. We will analyze the variable importance from the
Gradient Boosting model and categorize the top
predictors into biological and
process-related variables.
# Extract variable importance from Gradient Boosting model
gbmImportance <- summary(gbmModel, n.trees = 1000)
# Display the top 10 important predictors
top10GBM <- gbmImportance[1:10, ]
print("Top 10 Important Predictors in Gradient Boosting:")
## [1] "Top 10 Important Predictors in Gradient Boosting:"
print(top10GBM)
## var rel.inf
## ManufacturingProcess32 ManufacturingProcess32 23.228425
## BiologicalMaterial12 BiologicalMaterial12 6.696020
## BiologicalMaterial06 BiologicalMaterial06 6.006121
## ManufacturingProcess17 ManufacturingProcess17 5.129513
## BiologicalMaterial03 BiologicalMaterial03 4.345564
## BiologicalMaterial09 BiologicalMaterial09 3.572807
## ManufacturingProcess06 ManufacturingProcess06 2.860478
## ManufacturingProcess13 ManufacturingProcess13 2.746279
## ManufacturingProcess37 ManufacturingProcess37 2.279139
## ManufacturingProcess31 ManufacturingProcess31 2.217255
# Categorize predictors into biological and process variables
biologicalVars <- grep("^BiologicalMaterial", colnames(ChemicalManufacturingProcess), value = TRUE)
processVars <- grep("^ManufacturingProcess", colnames(ChemicalManufacturingProcess), value = TRUE)
# Identify top predictors by category
top10Biological <- intersect(top10GBM$var, biologicalVars)
top10Process <- intersect(top10GBM$var, processVars)
# Print results
cat("Biological Variables in Top 10:", top10Biological, "\n")
## Biological Variables in Top 10: BiologicalMaterial12 BiologicalMaterial06 BiologicalMaterial03 BiologicalMaterial09
cat("Process Variables in Top 10:", top10Process, "\n")
## Process Variables in Top 10: ManufacturingProcess32 ManufacturingProcess17 ManufacturingProcess06 ManufacturingProcess13 ManufacturingProcess37 ManufacturingProcess31
Although Gradient Boosting performed best, the single decision tree
provides interpretability. We will plot the decision tree and visualize
the distribution of Yield
in its terminal nodes.
# Visualize the single decision tree
rpart.plot(treeModel, main = "Optimal Single Decision Tree")
# Analyze terminal nodes using training data
trainData$TerminalNode <- treeModel$where
# Boxplot for distribution of Yield in terminal nodes
boxplot(Yield ~ TerminalNode, data = trainData,
main = "Distribution of Yield in Terminal Nodes",
xlab = "Terminal Nodes", ylab = "Yield")
The Gradient Boosting
model demonstrated the best
performance with the lowest RMSE of 0.5947, compared to
the Random Forest
(0.6754) and the
Single Tree
(0.9725) models. This result
highlights the effectiveness of ensemble methods like Gradient Boosting
for predictive accuracy.
From the Gradient Boosting model:
The top 10 predictors primarily include
biological variables such as BiologicalMaterial05
and
BiologicalMaterial07
.
Process variables like ManufacturingProcess32
also
appear among the top predictors, indicating their significant but
relatively lower contribution compared to biological variables.
The dominance of biological variables suggests their stronger influence in predicting yield. This aligns with domain knowledge that biological factors often drive variability in chemical manufacturing processes.
The decision tree reveals key splits based on predictors such as:
ManufacturingProcess32
, which serves as the root
node.
Biological variables like BiologicalMaterial06
and
BiologicalMaterial03
.
These splits highlight the hierarchical influence of the predictors, with manufacturing and biological factors working in tandem to explain yield variability.
The boxplot illustrates the distribution of yield across the terminal nodes:
The variability of yield within terminal nodes reflects the impact of key predictors on yield outcomes.
Nodes associated with specific splits (e.g.,
BiologicalMaterial05 >= 0.72
) show distinct yield
patterns, providing actionable insights into how predictors relate to
process outcomes.
The updated analysis confirms that:
Gradient Boosting is the optimal model for this data.
Biological variables are the most influential predictors, with process variables contributing moderately.
The tree structure and terminal node distributions provide additional interpretive power, showcasing specific relationships between predictors and yield.