The following RMD contains CUNY SPS DATA 624 Spring 2025 answers to exercises in chapter 8 of Kuhn and Johnson’s Applied Predictive Modeling textbook http://appliedpredictivemodeling.com/. This chapter focuses on nonlinear regression models. The chapter 8 Exercises answered here include 8.1, 8.2, 8.3, and 8.7.
library(tidyverse)
library(dplyr)
Prompt: Recreate the simulated data from Exercise 7.2:
# Provided code
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
# Visualize the results
print(head(simulated))
## V1 V2 V3 V4 V5 V6 V7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
## V8 V9 V10 y
## 1 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.6489177 0.53116033 0.9066182 20.85817
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
# Provided code
library(randomForest)
library(caret)
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
# Display the variable importance results
rfImp1
## Overall
## V1 8.62743275
## V2 6.27437240
## V3 0.72305459
## V4 7.50258584
## V5 2.13575650
## V6 0.12395003
## V7 0.02927888
## V8 -0.11724317
## V9 -0.10344797
## V10 0.04312556
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
The random forest model did not significantly use the uninformative predictors. Each of the V6 to V10 predictors have an importance value less than 0.10, meaning they are not thought to be significant predictors. This is the model successfully identifying these as uninformative.
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
# Provided code
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9485201
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
# Generate the new model with 1000 ntree
model2 <- randomForest(y ~ ., data = simulated,importance = TRUE,ntree = 1000)
# Extract variable importance from the fitted random forest model
rfImp2 <- varImp(model2, scale = FALSE)
# Display the variable importance results
rfImp2
## Overall
## V1 6.774034589
## V2 6.426340527
## V3 0.613805379
## V4 7.135941576
## V5 2.135242904
## V6 0.171933358
## V7 0.142238552
## V8 -0.073192083
## V9 -0.098719872
## V10 -0.009701234
## duplicate1 3.084990840
After adding a new variable that is highly correlated with V1, the importance score for V1 dropped from 8.59 to 6.32, and the new variable (duplicate1) had an importance score of about 4.08. The importance of the other variables slightly decreased compared to the first model. This shows that introducing a highly correlated predictor splits the original importance between the two variables.
Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
# Load the party library
library(party)
# Set up control parameters for the conditional forest
ctrl <- cforest_control(mtry = 2, ntree = 1000)
# Fit a conditional inference forest model
model_cforest <- cforest(y ~ ., data = simulated, controls = ctrl)
# Calculate variable importance without adjusting for correlations between predictors
imp_non_cond <- varimp(model_cforest, conditional = FALSE)
# Calculate conditional variable importance, adjusting for correlations between predictors
imp_cond <- varimp(model_cforest, conditional = TRUE)
# Display the non-conditional variable importance results
imp_non_cond
## V1 V2 V3 V4 V5 V6
## 4.05990861 3.67347532 0.09082457 3.95370860 1.46224059 0.05571918
## V7 V8 V9 V10 duplicate1
## 0.01212875 -0.03828363 -0.01370823 -0.08870209 2.12343800
# Display the conditional variable importance results
imp_cond
## V1 V2 V3 V4 V5 V6
## 2.08672381 2.67419299 0.09810369 2.90844608 0.95205879 0.02603743
## V7 V8 V9 V10 duplicate1
## -0.02289174 -0.01669622 -0.00389033 -0.03668175 1.03310323
No, the importance scores in this model do not show the same pattern as in the traditional random forest model. When the conditional input is set to true, the model adjusts for correlations between predictors, which leads to lower importance scores for duplicate1 and v1. In the traditional random forest model, these predictors may have higher importance scores since their correlation isn’t accounted for. This adjustment helps provide a more accurate and interpretable ranking by reflecting each variable’s unique contribution, unlike the traditional model where redundant predictors might be overemphasized.
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
# Load the necessary library for boosting
library(gbm)
library(caret)
# Train a boosted tree model using the 'gbm' method with cross-validation (5-fold)
gbm_model <- train(y ~ ., data = simulated, method = "gbm",
trControl = trainControl(method = "cv", number = 5),
verbose = FALSE)
# Extract variable importance from the model
gbm_imp_standard <- varImp(gbm_model, conditional = FALSE)$importance
# View the results
gbm_imp_standard
## Overall
## V1 77.5902823
## V2 81.3301225
## V3 30.6481321
## V4 100.0000000
## V5 43.4790270
## V6 2.6777405
## V7 2.4749641
## V8 0.2497932
## V9 0.5103595
## V10 0.0000000
## duplicate1 13.2091222
The boosted trees model seems to have the same patterns as previous models. The importance is highest for V4, V1, and V2. The least important variables are still V6-V10 with values all below 3.
# Load the necessary library for Cubist models
library(Cubist)
# Train a Cubist model using cross-validation (5-fold)
cubist_model <- train(y ~ ., data = simulated, method = "cubist",
trControl = trainControl(method = "cv", number = 5))
# Extract the variable importance from the trained Cubist model
cubist_imp <- varImp(cubist_model)$importance
# View the results
cubist_imp
## Overall
## V1 100.000000
## V3 54.166667
## V2 84.027778
## V4 65.277778
## V5 51.388889
## V6 14.583333
## V8 4.166667
## duplicate1 3.472222
## V7 0.000000
## V9 0.000000
## V10 0.000000
The Cubist model seems to have the same patterns as previous models. The importance is highest for V2, V1, and V4. The least important variables are still V6-V10 with values all below 10.
Prompt: Use a simulation to show tree bias with different granularities.
# Load appropriate libraries
library(rpart)
library(ggplot2)
# Set seed for reproducability
set.seed(64)
n <- 500
# Generate a continuous predictor
X1 <- rnorm(n)
# Categorical variable with 3 categories
X2 <- sample(letters[1:3], n, replace = TRUE)
# Categorical variable with 10 categories
X3 <- sample(letters[1:10], n, replace = TRUE)
# Target variable
y <- rnorm(n)
# Combine into a data frame
data <- data.frame(X1 = X1, X2 = as.factor(X2), X3 = as.factor(X3), y = y)
# Fit a regression tree model
tree_model <- rpart(y ~ ., data = data)
# Print variable importance (based on the rpart model)
print(tree_model$variable.importance)
## X1 X3 X2
## 18.4286410 15.6273275 0.3499268
# Plot the tree
plot(tree_model)
text(tree_model, use.n = TRUE)
# Create a bar plot to visualize variable importance
var_imp <- tree_model$variable.importance
var_imp_df <- data.frame(Variable = names(var_imp), Importance = var_imp)
ggplot(var_imp_df, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Variable Importance from Regression Tree",
x = "Variables", y = "Importance")
Tree models often assign higher importance to predictors with
more granular categories because they allow for more splits. In this
simulation, even though the data is random with no real correlations,
the model still ranks these granular variables higher. This illustrates
a bias in tree-based models, where more detailed variables are favored,
even if they don’t have any true predictive power.
Prompt: In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradi- ent. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
See textbook for referenced visualizations.
The model on the right focuses on the first few predictors because it uses a high bagging fraction and learning rate. A high bagging fraction means each tree samples a large portion of the data, and the large learning rate helps the trees converge quickly, favoring the strongest predictors. This leads to a model that overfits to these strong predictors.On the left, the model uses a smaller data sample and a low learning rate, meaning each tree sees different parts of the data and takes longer to converge. This allows the model to explore more predictors and reduces bias by using a wider range of features. So, the model on the right is more overfit with fewer predictors, while the one on the left is less overfit and explores more predictors.
Which model do you think would be more predictive of other samples?
The second model is likely to perform better on new data because it spreads out the importance of each variable, forcing the model to explore more patterns. The model on the right, with a high bagging fraction and learning rate, is more likely to overfit the training data, making it less effective on new samples. The smaller learning rate and bagging fraction in the second model allow it to sample smaller portions of the data, helping it capture the true structure and patterns, though it might be a bit more sensitive to noise. Overall, the second model is better suited for predicting new samples.
How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?
Increasing the interaction depth would flatten the predictor importance slope in both models by allowing the trees to explore more relationships between predictors. This would have a stronger effect on the model on the left, as it already has a low bagging fraction and learning rate, spreading the importance more evenly across predictors. On the right, the effect might be less pronounced due to the model’s potential overfitting, which could lead it to focus on the strongest predictors. By considering more variables, the model would capture a broader range of influences on the outcome, though it also risks overfitting as the tree depth increases.
Prompt: Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Which tree-based regression model gives the optimal resampling and test set performance?
# Provided code from 6.3
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Below code is the same as my Homework08.Rmd submission
# Predict missing values using KNN imputation function
chem_impute_vals <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute")
)
# Fill the predicted values
chem_imputed <- predict(
chem_impute_vals,
ChemicalManufacturingProcess
)
# Check for any remaining missing values
# sum(is.na(chem_imputed))
# Like in 6.2 b, filter using nearZeroVar
near_zero_chem <- nearZeroVar(chem_imputed)
filtered_chem <- chem_imputed[, -near_zero_chem]
# Create partition of 70-30
index_chem <- createDataPartition(filtered_chem$Yield, p = .7, list = FALSE)
# Split X into training and testing (exclude 'Yield' column)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]
# Split y into training and testing (only the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem <- filtered_chem[-index_chem, "Yield", drop = TRUE]
Random Forest:
# Set seed for reproducability
set.seed(64)
# Train a Random Forest model
rfModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
importance = TRUE)
# Predict on the model
rfPredChem <- predict(rfModelChem, newdata = X_test_chem)
#Get the test set performance values
rfPerformanceChem <- postResample(pred = rfPredChem, obs = y_test_chem)
rfPerformanceChem
## RMSE Rsquared MAE
## 0.6373700 0.5639333 0.4612384
Decision Tree:
# Set seed for reproducability
set.seed(64)
# Train a Decision Tree model
dtModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "rpart",
trControl = trainControl(method = "cv", number = 5))
# Predict on the model
dtPredChem <- predict(dtModelChem, newdata = X_test_chem)
#Get the test set performance values
dtPerformanceChem <- postResample(pred = dtPredChem, obs = y_test_chem)
dtPerformanceChem
## RMSE Rsquared MAE
## 0.7843351 0.3886078 0.5863452
Boosted Tree:
# Set seed for reproducability
set.seed(64)
# Train a Boosted Tree model
btModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "gbm",
trControl = trainControl(method = "cv", number = 5),
verbose = FALSE)
# Predict on the model
btPredChem <- predict(btModelChem, newdata = X_test_chem)
#Get the test set performance values
btPerformanceChem <- postResample(pred = btPredChem, obs = y_test_chem)
btPerformanceChem
## RMSE Rsquared MAE
## 0.6534278 0.5345959 0.4888034
Among the three models I built, the Boosted Tree produced the best results with the lowest RMSE of 0.665. The Random Forest had an RMSE of 0.710, while the Decision Tree had the highest RMSE at 0.871.
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
# Top 20 boosted tree variables
varImp(btModelChem)
## gbm variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 32.001
## ManufacturingProcess17 27.047
## BiologicalMaterial03 25.577
## BiologicalMaterial12 24.853
## ManufacturingProcess31 21.415
## ManufacturingProcess15 18.735
## BiologicalMaterial08 15.117
## ManufacturingProcess26 13.961
## ManufacturingProcess28 12.646
## ManufacturingProcess06 11.872
## BiologicalMaterial09 11.304
## ManufacturingProcess24 9.995
## BiologicalMaterial11 9.001
## ManufacturingProcess22 8.896
## BiologicalMaterial04 8.882
## ManufacturingProcess04 8.316
## ManufacturingProcess01 7.714
## ManufacturingProcess05 7.479
## ManufacturingProcess20 6.965
The optimal tree-based regression model is the Boosted Tree model.
Optimal nonlinear regression model = SVM Top Predictors:
Top 1 is a process variable, with overall importance value being more than double the next value
6/10 total are process variables
Process variables are more important, but they are not dominating the list.
Optimal nonlinear regression model = SVM Top Predictors:
Top 3 are process variables
6/10 total are process variables
Process variables are more important, but they are not dominating the list.
Optimal linear regression model = Linear Model (PLS) Top Predictors:
Top 4 are process variables
7/10 are process variables
Process variables are more important, and they are closer to dominating the list than the SVM model.
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
# Load library
library(rpart.plot)
# Plot the optimal single tree
print(rpart.plot(dtModelChem$finalModel))
## $obj
## n= 124
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 124 126.89400 0.0432564
## 2) ManufacturingProcess32< 0.191596 75 39.86306 -0.4784545 *
## 3) ManufacturingProcess32>=0.191596 49 35.37186 0.8417935 *
##
## $snipped.nodes
## NULL
##
## $xlim
## [1] -0.65 1.65
##
## $ylim
## [1] -0.85 1.85
##
## $x
## [1] 0.50174306 0.04166496 0.96182116
##
## $y
## [1] 0.93468675 0.03376948 0.03376948
##
## $branch.x
## [,1] [,2] [,3]
## x 0.5017431 0.04166496 0.9618212
## NA 0.04166496 0.9618212
## NA 0.50174306 0.5017431
##
## $branch.y
## [,1] [,2] [,3]
## y 1.11152 0.2106027 0.2106027
## NA 0.7380540 0.7380540
## NA 0.7380540 0.7380540
##
## $labs
## [1] "0.043\n100%" "-0.48\n60%" "0.84\n40%"
##
## $cex
## [1] 1
##
## $boxes
## $boxes$x1
## [1] 0.39540073 -0.05594823 0.87292179
##
## $boxes$y1
## [1] 0.83348146 -0.06743581 -0.06743581
##
## $boxes$x2
## [1] 0.6080854 0.1392782 1.0507205
##
## $boxes$y2
## [1] 1.1115200 0.2106027 0.2106027
##
##
## $split.labs
## [1] ""
##
## $split.cex
## [1] 1 1 1
##
## $split.box
## $split.box$x1
## [1] 0.005836189 NA NA
##
## $split.box$y1
## [1] 0.662426 NA NA
##
## $split.box$x2
## [1] 0.9976499 NA NA
##
## $split.box$y2
## [1] 0.813682 NA NA
This decision tree highlights that ManufacturingProcess 32 and BiologicalMaterial06 are the most important predictors. It shows the split points in the data and the percentages of data on each side of the splits, providing more insight into how these predictors relate to yield. The visualization makes it easier to understand these relationships.