Assignment 9

Daniel DeBonis

library(AppliedPredictiveModeling)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

8.1 Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a. Fit a random forest model to all of the predictors, then estimate the variable importance scores:

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)

Did the random forest model significantly use the uninformative predictors (V6– V10)?

print(rfImp1)
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

With scores this low, the uninformative predictors were not used. All of them have scores under .2 out of 10. V1 and V4 were the most informative predictors in this model.

b. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
print(rfImp2)
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

By adding a variable that is so strongly correlated with V1, the overall importance of V1 in our model decreases, since part of the role that variable played in the model is now being played by the duplicate variable.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
print(rfImp3)
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

Adding yet another duplicate variable further dilutes the importance of both V1 and the first duplicate. The variance that is accounted for by that variable is spread across the three variables in this model.

d. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
c_model <- cforest(y~., data=simulated)

imp_con <- varimp(c_model, conditional = TRUE)
print(imp_con)
##          V1          V2          V3          V4          V5          V6 
##  3.87666529  2.32554116  0.02041033  7.85292146  1.00530847  0.06644453 
##          V7          V8          V9         V10 
## -0.02061033  0.01306147 -0.01333696 -0.02609467

The pattern between the two models are very similar, with V4, V1 and V2 being the most important. The difference is using the cforest function, our model puts even more emphasis on V4 specifically. This model allows us to account for multicollinearity between our variables.

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), 
                      n.trees = seq(100, 1000, by = 100), 
                      shrinkage = 0.1, 
                      n.minobsinnode = 5)
model_gbm = train(y ~ ., data = simulated, 
                  tuneGrid = gbmGrid, verbose = FALSE, 
                  method = 'gbm')
gbm_imp<-varImp(model_gbm)
print(gbm_imp)
## gbm variable importance
## 
##     Overall
## V4  100.000
## V1   60.412
## V2   43.192
## V3   30.157
## V5   27.386
## V9    3.726
## V6    2.717
## V10   2.012
## V8    1.112
## V7    0.000

V6-V10 are still playing a minimal role in our model, if any. V4 is clearly the most significant variable, with V1 and V2 behind it.

library(Cubist)
cubist <- cubist(simulated[,-11], simulated[, 11])
cubist_imp <- varImp(cubist)
print(cubist_imp)
##     Overall
## V1     91.0
## V2     91.0
## V3     65.5
## V4     67.5
## V5     50.0
## V8     13.5
## V10     3.0
## V6      0.0
## V7      0.0
## V9      0.0

Like the other models, the cubist shows V4 as the most important variable in the model, followed by V2 and V1. We see a consistent pattern across all models tested in which variables are most and least important.

8.2 Use a simulation to show tree bias with different granularities.

n <- 100
train <- data.frame(
  a = rnorm(n, mean = 0, sd = 1),
  b = rnorm(n, mean = 5, sd = 5),
  c = rnorm(n, mean = 100, sd = 10)
 ) |>
  mutate(
    y = (a+b+c) * rnorm(n, 1, sd = 0.2)
  )

trainIndex <- createDataPartition(train$y, p = 0.8, list = FALSE)
trainData <- train[trainIndex, ]
testData <- train[-trainIndex, ]

rpartTune <- train(
  y ~ ., 
  data = trainData,
  method = "rpart", 
  tuneLength = 10,  
  trControl = trainControl(method = "cv")  
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(rpartTune)

Here we can see at different levels of granularity of investigating the data, we have different values for RMSE. It seems that a parameter of .05 would be the best at minimizing RMSE without us being concerned of overfitting, since there are models using even more granular analysis that produce larger error values.

8.3 In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient.Although the optimal values of these parameters should beobtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

a. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The right hand model uses a fraction of .9 for both bagging fraction and learning rate, so the model is training quickly by looking at the majority of the data at once as opposed to the left hand model, with its bagging fraction of .1, only uses 10% of the total data to fit its model.

b. Which model do you think would be more predictive of other samples?

The two models represent two extremes. I feel the left-hand model has too small of a window into the data to be useful for predictions where the right-hand model is overtrained on the data and may not generalize well to new data points. If I had to choose one, I would imagine the left-hand model would be more predictive of other samples due specifically to the lower learning rate.

c. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

There would not likely be as much of an impact on the right-hand model. Using 90% of the data for each model makes it so there is likely less to be revealed by increasing the depth of the interactions. A greater impact would be seen on the left-hand model. Looking at interactions more deeply may help to expose patterns within the data that may be missed due to the smaller size of each sample.

8.7 Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data("ChemicalManufacturingProcess")
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(RANN)
knn <- preProcess(ChemicalManufacturingProcess,
  method = c("center","scale","knnImpute"))
knnpre <- predict(knn, ChemicalManufacturingProcess)
missmap(knnpre, col = c("yellow", "navy"))

near_zero_chem <- nearZeroVar(knnpre)
filtered_chem <- knnpre[, -near_zero_chem]
ncol(filtered_chem)
## [1] 57
set.seed(79)

index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)

# We need to exclude 'Yield' column since it is the predicted variable)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem  <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]

# Only using the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem  <- filtered_chem[-index_chem, "Yield", drop = TRUE]

a. Which tree-based regression model gives the optimal resampling and test set performance?

set.seed(200)
cartTune <- train(X_train_chem, y_train_chem,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cartTune
## CART 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01512226  0.7157069  0.5228782  0.5650310
##   0.01615008  0.7187977  0.5165319  0.5661045
##   0.01752915  0.7243224  0.5112672  0.5668065
##   0.01867819  0.7155849  0.5220335  0.5618331
##   0.03190252  0.7187842  0.5077316  0.5778305
##   0.03656443  0.7351140  0.4887174  0.5981998
##   0.04929112  0.7197995  0.4959550  0.5803143
##   0.05938974  0.7186304  0.5012085  0.5768006
##   0.10224046  0.7343931  0.4787643  0.5838474
##   0.40438968  0.9059819  0.2991268  0.7188346
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01867819.
rpartPred <- predict(cartTune, newdata = X_test_chem)
postResample(pred = rpartPred, obs = y_test_chem)
##      RMSE  Rsquared       MAE 
## 0.7801803 0.4903291 0.6155187

For the single tree model, the optimized complexity parameter is at 0.0493. Using the model on our test data, we can see that our model accounts for about 49% of the variance in that data.

set.seed(200)
library(ipred)
baggedTree <- ipredbagg(y_train_chem, X_train_chem)
baggPred <- predict(baggedTree, newdata = X_test_chem)
postResample(pred = baggPred, obs = y_test_chem)
##      RMSE  Rsquared       MAE 
## 0.6519623 0.6496543 0.5133644

Going by the Rsquared produced by this model, the bagged trees model is a strong improvement over the single tree. The bagged trees model is accounting for nearly 65% of the variance in our testing data, an increase in performance of about double.

set.seed(200)
rforest <- train(X_train_chem, y_train_chem,
                  method = "rf",
                  tuneLength = 5,
                  trControl = trainControl(method = "cv"))
rfPred <- predict(rforest, newdata = X_test_chem)
postResample(pred = rfPred, obs = y_test_chem)
##      RMSE  Rsquared       MAE 
## 0.5903720 0.7166334 0.4464794

The random forest model performs even stronger than the bagged trees, since the random forest model is accounting for 71.7% of the variance at its optimal value of mtry of 15.

set.seed(200)
boosted <- train(X_train_chem, y_train_chem,
                   method = "gbm",
                   verbose = FALSE,
                   tuneLength = 5,
                   trControl = trainControl(method = "cv"))
boostPred <- predict(boosted, newdata = X_test_chem)
postResample(pred = boostPred, obs = y_test_chem)
##      RMSE  Rsquared       MAE 
## 0.5419035 0.7511395 0.4426527

Once again, this model captures more variance than the last. For the boosted trees, over 75% of the variance is captured by the model, making it the best predictor yet.

set.seed(200)
cubist_model <- train(  
  x = X_train_chem,
  y = y_train_chem,
  method = "cubist"
)
cubPred <- predict(cubist_model, newdata = X_test_chem)
postResample(pred = cubPred, obs = y_test_chem)
##      RMSE  Rsquared       MAE 
## 0.4707244 0.8037697 0.3424563

The pattern of improvement in our model continues. The cubist model captures over 80% of the variance.

a. Which tree-based regression model gives the optimal resampling and test set performance?

The optimal resampling and test set performance was seen using a cubist regression model. Not only did it provide the highest value for r squared (.804), it also yielded the lowest value for RMSE (.470).

b. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

cubistImp <- varImp(cubist_model)
print(cubistImp)
## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess17  78.125
## BiologicalMaterial03    55.208
## ManufacturingProcess13  54.167
## BiologicalMaterial02    44.792
## ManufacturingProcess33  37.500
## ManufacturingProcess04  31.250
## ManufacturingProcess09  31.250
## ManufacturingProcess19  27.083
## ManufacturingProcess11  26.042
## ManufacturingProcess29  23.958
## ManufacturingProcess31  22.917
## ManufacturingProcess27  16.667
## ManufacturingProcess39  15.625
## ManufacturingProcess24  15.625
## ManufacturingProcess25  14.583
## BiologicalMaterial12    12.500
## BiologicalMaterial01    10.417
## BiologicalMaterial04    10.417
## ManufacturingProcess22   9.375

This list is dominated primarily by manufacturing processes, making up 8 of the top 10. This is consistent with what I saw in the optimal linear model and optimal non-linear model. The exact order and weights of the different processes vary across the models, but processes 32 and 17 are consistently the most important.

c. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart)
cartTune$finalModel
## n= 144 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 144 138.983800 -0.02054704  
##    2) ManufacturingProcess32< 0.191596 82  39.340670 -0.56378470  
##      4) BiologicalMaterial12< -0.6009825 33  12.970970 -0.95039380  
##        8) BiologicalMaterial05>=0.4725039 7   1.926762 -1.65683400 *
##        9) BiologicalMaterial05< 0.4725039 26   6.610272 -0.76019830 *
##      5) BiologicalMaterial12>=-0.6009825 49  18.115480 -0.30341530  
##       10) ManufacturingProcess25>=-0.0004697389 37   7.425615 -0.51635580 *
##       11) ManufacturingProcess25< -0.0004697389 12   3.839197  0.35315120 *
##    3) ManufacturingProcess32>=0.191596 62  43.439550  0.69792860  
##      6) ManufacturingProcess17>=-1.31689 53  25.512250  0.50064950  
##       12) ManufacturingProcess09< -0.4721252 12   5.729688 -0.07171796 *
##       13) ManufacturingProcess09>=-0.4721252 41  14.700700  0.66817170 *
##      7) ManufacturingProcess17< -1.31689 9   3.717524  1.85968300 *
rpart.plot::rpart.plot(cartTune$finalModel)

The tree view of the data is helpful in visualizing the relationship between the variables mentioned and their effect on the overall yield. We can see more clearly where the biological and manufacturing elements interact to affect the final yield. For instance, BioMaterial12 was not within the top 10 important variables in the cubist model, but here is is a major linchpin in the tree. The differences within that variable are not statistically significant on their own, only when combined with the following step, incorporating additional variables.