DATA 624 Homework 9

Exercise 8.1

recreate this code :

library(mlbench)
library(Cubist)

## Loading required package: lattice

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

library(partykit)

## Loading required package: libcoin

## 
## Attaching package: 'partykit'

## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

library(AppliedPredictiveModeling)
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

The random forest model demonstrated that the informative predictors (V1–V5) had much higher importance values, while the uninformative predictors (V6–V10) exhibited importance scores near zero or even slightly negative. This indicates that the model did not significantly utilize the uninformative predictors when building the trees. In summary, the random forest correctly concentrated on the relevant predictors (V1–V5) and effectively ignored the irrelevant ones (V6–V10), as expected.

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

model1 <- randomForest(y ~ ., data = simulated,
                        importance =TRUE,
                        ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE
                 )
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)

## [1] 0.9460206

model2 <- randomForest(y ~ ., data = simulated,
                        importance =TRUE,
                        ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE
                 )
rfImp2

##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581

Yes, the importance score for V1 dropped from around 9.07 (original) to 6.10 after adding the duplicate variable.After adding a new predictor (duplicate1) that is highly correlated with V1, the random forest model showed a noticeable change in the importance scores. The importance value for V1 decreased from its original value, and the new variable, duplicate1, also gained a significant importance score. This indicates that the model split the predictive importance between V1 and its duplicate, rather than concentrating it all on the original variable. In other words, when two highly correlated predictors are present, random forests tend to divide the importance between them, reducing the individual importance of each.

c)Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated[, c(1:11)])

rfImp4 <- varimp(model4, conditional = TRUE)

rfImp4

##         V1         V2         V3         V4         V5         V6         V7 
##  6.1752967  5.0261327  0.1960935  6.2417122  1.2772163 -0.2372838 -0.1358113 
##         V8         V9        V10 
## -0.2235061 -0.2228211 -0.3024928

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

library(caret)
library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# Boosted Tree Model
boosted_model <- train(y ~ ., data = simulated,
                       method = "gbm",
                       verbose = FALSE)

# Variable importance
varImp(boosted_model)

## gbm variable importance
## 
##             Overall
## V4         100.0000
## V2          71.4235
## V1          64.9137
## V5          47.0234
## duplicate1  34.4863
## V3          27.2596
## V7           3.5976
## V6           2.0623
## V9           0.7852
## V8           0.5350
## V10          0.0000

# Cubist Model
cubist_model <- train(y ~ ., data = simulated,
                      method = "cubist")

# Variable importance
varImp(cubist_model)

## cubist variable importance
## 
##            Overall
## V2          100.00
## V1           89.52
## V4           80.65
## V3           67.74
## duplicate1   59.68
## V5           50.00
## V6           25.00
## V9            0.00
## V10           0.00
## V7            0.00
## V8            0.00

When fitting boosted trees to the simulated data, the variable importance scores showed a similar pattern to the traditional random forest model: the importance of V1 decreased, and the highly correlated duplicate variable gained a non-trivial amount of importance. This indicates that boosted trees also tend to split importance among correlated predictors. However, when fitting the Cubist model, V1 maintained a high level of importance, while the duplicate variable showed no importance at all. This suggests that Cubist was able to preferentially select the more relevant predictor and largely ignore the redundant, correlated variable. Overall, while boosted trees exhibited the same splitting behavior as random forests, Cubist displayed a different pattern by not splitting importance between correlated predictors.

Exercise 8.2 Use a simulation to show tree bias with different granularities.

set.seed(624)
 
library(mlbench)
set.seed(123)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

# Step 2: Fit simple trees (low depth)
library(rpart)
simple_tree <- rpart(y ~ ., data = simulated, control = rpart.control(maxdepth = 2))

# Step 3: Fit complex trees (high depth)
complex_tree <- rpart(y ~ ., data = simulated, control = rpart.control(maxdepth = 10))

# Step 4: Variable importance
simple_importance <- simple_tree$variable.importance
complex_importance <- complex_tree$variable.importance

# Step 5: Compare
print(simple_importance)

##         V4         V2         V8         V1         V3         V6         V9 
## 1447.70777  905.46545  251.76510  167.13969  141.07707  137.33708  103.09077 
##        V10         V5         V7 
##   54.88233   30.38904   27.44116

print(complex_importance)

##        V4        V2        V1        V8        V6        V5        V9        V3 
## 1821.5226 1124.7965  405.5857  373.6558  316.4571  273.4891  195.1682  190.4739 
##        V7       V10 
##  173.4785  166.9086

# Step 1: Get importance
simple_importance <- simple_tree$variable.importance
complex_importance <- complex_tree$variable.importance

# Step 2: Turn into data frames
simple_df <- data.frame(
  Variable = names(simple_importance),
  Importance = as.numeric(simple_importance)
)

complex_df <- data.frame(
  Variable = names(complex_importance),
  Importance = as.numeric(complex_importance)
)

# Step 3: Order nicely for plotting
simple_df <- simple_df[order(simple_df$Importance), ]
complex_df <- complex_df[order(complex_df$Importance), ]

# Step 4: Load ggplot2
library(ggplot2)

# Step 5: Create plots
library(gridExtra) # for side-by-side plots

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:randomForest':
## 
##     combine

plot_simple <- ggplot(simple_df, aes(x = Importance, y = reorder(Variable, Importance))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Simple Tree (Low Depth)", x = "Importance", y = "Variable") +
  theme_minimal()

plot_complex <- ggplot(complex_df, aes(x = Importance, y = reorder(Variable, Importance))) +
  geom_bar(stat = "identity", fill = "darkorange") +
  labs(title = "Complex Tree (High Depth)", x = "Importance", y = "Variable") +
  theme_minimal()

# Step 6: Plot them side by side
grid.arrange(plot_simple, plot_complex, ncol = 2)

Exercise 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right has both the bagging fraction and learning rate set to 0.9, meaning it uses almost all the data for each tree and makes large updates at each boosting iteration. As a result, it quickly locks onto the strongest predictors and heavily relies on them, ignoring weaker variables. In contrast, the model on the left (bagging fraction and learning rate both at 0.1) introduces much more randomness and smaller updates, forcing the model to spread importance across more predictors. The randomness prevents early overfitting and allows weaker predictors to contribute to the model.

Which model do you think would be more predictive of other samples?

The model on the left (with bagging fraction and learning rate at 0.1) is likely to be more predictive on new samples. Its higher randomness and smaller updates encourage better generalization by avoiding over-reliance on a few strong predictors. In contrast, the right-hand model, while fitting the training data more aggressively, is at greater risk of overfitting because it focuses almost exclusively on the most dominant variables, making it less robust to variations in unseen data.

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would allow the models to capture more complex relationships between predictors. As a result, the importance scores might become more evenly distributed across predictors. Weaker predictors could gain more importance if they are involved in higher-order interactions with strong predictors. Therefore, the slope of the predictor importance plots would likely become less steep — importance would not drop off as sharply after the top few variables.

Exercise 8.7

library(ipred)
set.seed(42)

data(ChemicalManufacturingProcess)

# imputation
miss <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(miss, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(624)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]




# Singele Tree
set.seed(42)

cartTune <- train(train_x, train_y,
                  method = "rpart",
                  tuneLength = 10,
                  trControl = trainControl(method = "cv"))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

cartPred <- predict(cartTune, test_x)

postResample(cartPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.3210558 0.6408257 1.0306280

# Bagged Tree
set.seed(42)

baggedTree <- ipredbagg(train_y, train_x)

baggedPred <- predict(baggedTree, test_x)

postResample(baggedPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2397665 0.7127647 0.9135486

# ramdom forest 
set.seed(42)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2403562 0.7357469 0.8945969

# Boosted Trees
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(42)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.2041025 0.7124169 0.8724615

#Cubist 
set.seed(42)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##      RMSE  Rsquared       MAE 
## 1.0433658 0.7635791 0.7995865

table_model <- rbind(cart = postResample(cartPred, test_y),
                     bagged = postResample(baggedPred, test_y),
                     randomForest = postResample(rfPred, test_y),
                     boosted = postResample(gbmPred, test_y),
                     cubist = postResample(cubistPred, test_y))

varImp(cubistTuned, scale = TRUE)

## cubist variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   34.51
## BiologicalMaterial12     32.74
## ManufacturingProcess17   31.86
## BiologicalMaterial06     29.20
## ManufacturingProcess11   24.78
## BiologicalMaterial11     23.01
## ManufacturingProcess09   22.12
## ManufacturingProcess10   18.58
## BiologicalMaterial02     17.70
## ManufacturingProcess33   17.70
## ManufacturingProcess16   15.93
## BiologicalMaterial03     15.04
## ManufacturingProcess04   15.04
## BiologicalMaterial08     15.04
## ManufacturingProcess39   12.39
## ManufacturingProcess31   12.39
## ManufacturingProcess42   12.39
## ManufacturingProcess25   11.50
## ManufacturingProcess29   11.50

plot(varImp(cubistTuned), top = 10)

Among the tree-based regression models evaluated — including a single decision tree (CART), bagged trees, random forest, boosted trees, and Cubist — the Cubist model demonstrated the best resampling and test set performance. Specifically, Cubist achieved the lowest root mean squared error (RMSE) of 1.043, the highest R-squared value of 0.764, and the lowest mean absolute error (MAE) of 0.800. These results indicate that the Cubist model was not only the most accurate in predicting yield but also explained the largest proportion of variance in the data. Therefore, the Cubist model was selected as the optimal tree-based regression model for the chemical manufacturing dataset.

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

he most important predictors in the Cubist model are primarily process variables, led by ManufacturingProcess32, which has the highest importance at 100. Other key process variables include ManufacturingProcess13, ManufacturingProcess17, ManufacturingProcess11, and ManufacturingProcess09.

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

library(rpart.plot)

# Build your tree
rpartTree <- rpart(Yield ~ ., data = Chemical[index, ])

# Plot the tree nicely
rpart.plot(rpartTree,
           type = 2,             # Split labels under the nodes
           extra = 101,          # Show % observations and mean yield
           box.palette = "Blues", # Blue shades based on predicted yield
           fallen.leaves = TRUE, # Terminal nodes neatly at the bottom
           shadow.col = "gray",  # Soft shadow around the boxes
           cex = 0.7)

he decision tree highlights that ManufacturingProcess32 is the primary driver of yield, with a clear threshold around 159.5. Subsequent splits involve both biological and process variables, confirming their complementary importance. The distribution of yield in the terminal nodes shows distinct groups, suggesting that specific value ranges of key predictors significantly impact yield. This visualization provides additional, actionable insights into how process and biological factors interact to influence yield.

DATA 624 Homework 9

Warner Alexis

2025-04-27

Exercise 8.1