Assignment9

library(mlbench)
set.seed(200)
library(caret)
library(AppliedPredictiveModeling)
library(randomForest)
library(dplyr)
library(party)
library(gbm)
library(rpart)
library(rpart.plot)
library(ggplot2)

8.1

Recreate the simulated data from Exercise 7.2:

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

a) Fit a random forest model to all of the predictors, then estimate the variable importance scores: Did the random forest model significantly use the uninformative predictors (V6 – V10)?

model1 <- randomForest(y ~ ., data = simulated,
  importance = TRUE,
  ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1

##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

The variable importance falls off steeply after V5. The importance for the uninformative predictors are close to zero while the informative predictor scores are substantial.

b) Now add an additional predictor that is highly correlated with one of the informative predictors. Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

simulated_plus1predictor <- simulated

simulated_plus1predictor$duplicate1 <- simulated_plus1predictor$V1 + rnorm(200) * 0.1
cor(simulated_plus1predictor$duplicate1, simulated_plus1predictor$V1)

## [1] 0.9460206

model_rf_plus1 <- randomForest(
  y ~ ., data = simulated_plus1predictor,
  importance = TRUE,
  ntree = 1000
)
rfImp_plus1 <- varImp(model_rf_plus1, scale = FALSE)
rfImp_plus1_sorted <- rfImp_plus1 %>% arrange(desc(Overall))
rfImp_plus1_sorted

##                Overall
## V4          7.04752238
## V2          6.06896061
## V1          5.69119973
## duplicate1  4.28331581
## V5          1.87238438
## V3          0.62970218
## V6          0.13569065
## V10         0.02894814
## V9          0.00840438
## V7         -0.01345645
## V8         -0.04370565

Here we see the addition of a correlated variable dropped the importance of V1 from 8.73 to 5.69, with now V4 taking top spot. The more correlated variables we add the importance of them as a set will keep on dropping.

simulated_plus2predictors <- simulated_plus1predictor
simulated_plus2predictors$duplicate2 <- simulated_plus2predictors$V1 + rnorm(200) * 0.01
cor(simulated_plus2predictors$duplicate2, simulated_plus2predictors$V1)

## [1] 0.999376

model_rf_plus2 <- randomForest(
  y ~ ., data = simulated_plus2predictors,
  importance = TRUE,
  ntree = 1000
)
rfImp_plus2 <- varImp(model_rf_plus2, scale = FALSE)
rfImp_plus2_sorted <- rfImp_plus2 %>% arrange(desc(Overall))
rfImp_plus2_sorted

##                Overall
## V4          7.09158234
## V2          6.78792845
## V1          4.18035786
## duplicate2  4.06559574
## duplicate1  2.95441997
## V5          1.99621666
## V3          0.64037145
## V6          0.15136139
## V10         0.06924451
## V7          0.02166139
## V9         -0.05078724
## V8         -0.08876124

We see that V1s duplicate further drops to 4.18 and is now diluted into duplicate 1 and 2. We see that random forests are sensitive to multicollinearity

c) Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model_cforest <- cforest(
  y ~ ., data = simulated_plus2predictors,
  controls = cforest_unbiased(ntree = 1000)
)
cforest_imp <- varimp(model_cforest)
cforest_imp_conditional <- varimp(model_cforest, conditional = TRUE)

Regular

sort(cforest_imp, decreasing = TRUE)

##           V4           V2   duplicate1           V1   duplicate2           V5 
##  7.037023814  5.842185785  3.953648578  3.813500996  2.423641256  1.568077773 
##           V3           V7           V6           V9           V8          V10 
##  0.023476911  0.013065210  0.002165501 -0.014321460 -0.022004251 -0.040804439

Conditional

sort(cforest_imp_conditional, decreasing = TRUE)

##           V4           V2   duplicate1           V1           V5   duplicate2 
##  5.587500324  4.573214880  1.357034540  1.335761172  1.097710024  1.056913796 
##           V6           V7           V9           V3           V8          V10 
##  0.025747278  0.009038659  0.004086155  0.001745259  0.001492007 -0.015866157

In random forest, both duplicates show relatively high importance, with duplicate2 higher than duplicate1.In regular cforest, duplicate1 is more important than duplicate2, but both are lower than in random forest.In conditional cforest, the importance of both duplicates drops substantially, reflecting the method’s adjustment for correlation between predictors.

Boosted

model_boosted1 <- train(
  y ~ ., data = simulated,
  method = "gbm",
  verbose = FALSE,
  trControl = trainControl(method = "cv")
)

boosted_imp1 <- varImp(model_boosted1)$importance
boosted_imp1 <- boosted_imp1[order(boosted_imp1$Overall, decreasing = TRUE), , drop = FALSE]
print(boosted_imp1)

##         Overall
## V4  100.0000000
## V1   87.8605104
## V2   72.4122436
## V5   46.5770853
## V3   26.6947227
## V6    1.2998739
## V7    1.2198369
## V9    0.4203124
## V8    0.3256926
## V10   0.0000000

model_boosted2 <- train(
  y ~ ., data = simulated_plus1predictor,
  method = "gbm",
  verbose = FALSE,
  trControl = trainControl(method = "cv")
)

boosted_imp2 <- varImp(model_boosted2)$importance
boosted_imp2 <- boosted_imp2[order(boosted_imp2$Overall, decreasing = TRUE), , drop = FALSE]
print(boosted_imp2)

##                 Overall
## V4         100.00000000
## V2          72.95352598
## duplicate1  65.02855908
## V1          32.68979890
## V5          32.37670927
## V3          23.35722711
## V9           5.40107326
## V7           1.82152089
## V6           0.74658255
## V8           0.01413698
## V10          0.00000000

model_boosted3 <- train(
  y ~ ., data = simulated_plus2predictors,
  method = "gbm",
  verbose = FALSE,
  trControl = trainControl(method = "cv")
)

boosted_imp3 <- varImp(model_boosted3)$importance
boosted_imp3 <- boosted_imp3[order(boosted_imp3$Overall, decreasing = TRUE), , drop = FALSE]
print(boosted_imp3)

##                 Overall
## V4         100.00000000
## V2          73.89107909
## V1          59.89924903
## V5          37.35815544
## V3          28.17178089
## duplicate1  21.00899753
## duplicate2  11.50615917
## V9           1.50397970
## V7           1.38884765
## V6           0.77833787
## V8           0.01272447
## V10          0.00000000

Similar to the other models adding highly correlated predictors reduces the importance of v1 and pushes its duplicates into the top.While v1 drops as a result, it only drops one place below V2 when adding the correlated duplicates.

model_cubist1 <- train(
  y ~ ., data = simulated,
  method = "cubist",
  trControl = trainControl(method = "cv")
)

cubist_imp1 <- varImp(model_cubist1)$importance
cubist_imp1 <- cubist_imp1[order(cubist_imp1$Overall, decreasing = TRUE), , drop = FALSE]
print(cubist_imp1)

##       Overall
## V1  100.00000
## V2   77.30496
## V4   70.92199
## V5   67.37589
## V3   58.15603
## V6   19.85816
## V7    0.00000
## V8    0.00000
## V9    0.00000
## V10   0.00000

model_cubist2 <- train(
  y ~ ., data = simulated_plus1predictor,
  method = "cubist",
  trControl = trainControl(method = "cv")
)

cubist_imp2 <- varImp(model_cubist2)$importance
cubist_imp2 <- cubist_imp2[order(cubist_imp2$Overall, decreasing = TRUE), , drop = FALSE]
print(cubist_imp2)

##              Overall
## V2         100.00000
## V1          89.51613
## V4          80.64516
## V3          67.74194
## duplicate1  59.67742
## V5          50.00000
## V6          25.00000
## V7           0.00000
## V8           0.00000
## V9           0.00000
## V10          0.00000

model_cubist3 <- train(
  y ~ ., data = simulated_plus2predictors,
  method = "cubist",
  trControl = trainControl(method = "cv")
)

cubist_imp3 <- varImp(model_cubist3)$importance
cubist_imp3 <- cubist_imp3[order(cubist_imp3$Overall, decreasing = TRUE), , drop = FALSE]
print(cubist_imp3)

##                Overall
## V2         100.0000000
## V1          79.7101449
## V4          77.5362319
## duplicate2  70.2898551
## V3          58.6956522
## V5          47.8260870
## duplicate1  34.7826087
## V6          17.3913043
## V8           7.2463768
## V10          0.7246377
## V7           0.0000000
## V9           0.0000000

Boosted trees show the largest drop in v1 importance with multiple duplicates , more than random forest and cforest. Also, the top predictor v4 stays stable, unlike in cubist where V1 leads sometimes. Also in cubist duplicates displace v1 less dramatically , and several low predictors remain near zero rather than being spread out.Boosted also seem to redistribute importance more evenly among duplicates while cubist gives one duplicate higher priority and the other lower.

8.2

Use a simulation to show tree bias with different granularities.

Lets first simulate a dataset with 5 predictors that vary in granularity, a mix of categorical and continuous with varying variances.

n <- 200

v1 <- factor(sample(c("A", "B"), n, replace = TRUE))       
v2 <- factor(sample(letters[1:3], n, replace = TRUE))     
v3 <- factor(sample(letters[1:5], n, replace = TRUE)) 
v4 <- rnorm(n, mean = 5, sd = 0.5) 
v5 <- rnorm(n, mean = 5, sd = 2)  
y <- rnorm(n, mean = as.numeric(v1) + as.numeric(v2) + as.numeric(v3) + v4 + v5, sd = 1)

sim_gran_cont <- data.frame(y, v1, v2, v3, v4, v5)

We can then fit a regression tree, plot it and look at its first split and importance and compare it to the granularity

tree_model_cont <- rpart(y ~ ., data = sim_gran_cont, method = "anova")
rpart.plot(tree_model_cont, type = 2, extra = 101, fallen.leaves = TRUE)

We see that the predictor with the highest number of possibly distinct values V5 gets chosen for the first split. We can estimate how often this occurs by repeating the simulation with these predictors

n_sim <- 100
first_split <- character(n_sim)

for(i in 1:n_sim){
  v1 <- factor(sample(c("A", "B"), n, replace = TRUE))
  v2 <- factor(sample(letters[1:3], n, replace = TRUE))
  v3 <- factor(sample(letters[1:5], n, replace = TRUE))
  v4 <- rnorm(n, mean = 5, sd = 0.5)
  v5 <- rnorm(n, mean = 5, sd = 2)
  y <- rnorm(n, mean = as.numeric(v1) + as.numeric(v2) + as.numeric(v3) + v4 + v5, sd = 1)
  sim_data <- data.frame(y, v1, v2, v3, v4, v5)
  
  tree_model <- rpart(y ~ ., data = sim_data, method = "anova")
  
  first_split[i] <- as.character(tree_model$frame$var[1])
}

first_split_table <- table(first_split)
first_split_table

## first_split
## v3 v5 
##  2 98

In a 100 simulations we see V5 get chosen almost every single time. V5 being a continuous variable with a large variance gives it a lot of potential to reduce the rss when used as the first split and will always win out for the root compared to the other predictors with lesser distinct values.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The high learning rate and bag fraction for the model on the right cause each boosting step to make big updates using pretty much the the entire dataset (90 percent). This causes the model to acquire the strongest predictors quicker and tries to fit the residual variation using the strong predictors and doesn’t leave much for the weaker predictors to explain. In the left side, the smaller values mean that the model is being trained on smaller subsamples each time and each tree captures only a tiny portion of the underlying residual structure. This causes the model to spread its importance across more predictors as more of them are able to contribute meaningful information incrementally over the large amount of small updates.

(b)

Which model do you think would be more predictive of other samples?

The model on the left would probably be more predictive of other samples because the smaller learning rate and smaller samples force it to learn the structure more slowly and more generally. It’s not honing in on the strongest predictors right away, and it’s letting a wider range of variables contribute in incremental updates. That usually leads to a more generalized model that isn’t making huge jumps or over-fitting looking at just the training data. The model on the right is much more aggressive, and while it might fit the training data faster, it’s more likely to over fit and not perform as well on new samples.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

Increasing the interaction depth would let the trees capture more complexity as they are allowed more splits or more nodes , and when that is allowed to happen I think the importance would get spread across the variables and allow the smaller predictors to contribute more . If that happens the slope would get shallower if we are looking at it as it is in the figures, would get steeper as the falloff becomes a lot less drastic since other predictors would come into play when the splits go deeper.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data("ChemicalManufacturingProcess") 
trans <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
chemicalData <- predict(trans, ChemicalManufacturingProcess)

trainIndex <- createDataPartition(chemicalData$Yield, p = 0.8, list = FALSE)
chemTrain <- chemicalData[trainIndex, ]
chemTest  <- chemicalData[-trainIndex, ]

trainX <- chemTrain[, -which(names(chemTrain) == "Yield")]
trainY <- chemTrain$Yield
testX  <- chemTest[, -which(names(chemTest) == "Yield")]
testY  <- chemTest$Yield

(a)

Which tree-based regression model gives the optimal resampling and test set performance?

We can try all the models that we have tried through this assingnment

treeModel_chem <- rpart(Yield ~ ., data = chemTrain, method = "anova")
rfModel_chem <- randomForest(Yield ~ ., data = chemTrain, importance = TRUE, ntree = 1000)
cforestModel_chem <- cforest(Yield ~ ., data = chemTrain, controls = cforest_unbiased(ntree=1000))
boostedModel_chem <- train(Yield ~ ., data = chemTrain, method = "gbm", verbose = FALSE, trControl = trainControl(method = "cv"))
cubistModel_chem <- train(Yield ~ ., data = chemTrain, method = "cubist", trControl = trainControl(method = "cv"))

testX <- chemTest[, -which(names(chemTest) == "Yield")]
testY <- chemTest$Yield

evaluate <- function(model, testX, testY, type = c("caret","base","cforest")) {
  type <- match.arg(type)
  preds <- switch(type,
                  caret = predict(model, newdata = testX),
                  base  = predict(model, newdata = testX),
                  cforest = as.numeric(predict(model, newdata = testX, OOB=FALSE, type="response")))
  data.frame(RMSE = RMSE(preds, testY), Rsquared = R2(preds, testY))
}

perf_list <- list(
  RegressionTree = evaluate(treeModel_chem, testX, testY, "base"),
  RandomForest   = evaluate(rfModel_chem, testX, testY, "base"),
  CForest        = evaluate(cforestModel_chem, testX, testY, "cforest"),
  Boosted        = evaluate(boostedModel_chem, testX, testY, "caret"),
  Cubist         = evaluate(cubistModel_chem, testX, testY, "caret")
)

perf_table <- do.call(rbind, perf_list)
perf_table

##                     RMSE  Rsquared
## RegressionTree 0.7755147 0.4964955
## RandomForest   0.6177840 0.6342400
## CForest        0.6917245 0.5699876
## Boosted        0.5904234 0.6509749
## Cubist         0.5016777 0.7575507

Cubist has the best test set performance with the lowest RMSE at 0.52 and R squared close to random forests top 0.70, and regression tree performs the works

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

cubist_imp <- varImp(cubistModel_chem)$importance
cubist_imp_sorted <- cubist_imp[order(cubist_imp$Overall, decreasing = TRUE), , drop = FALSE]
print(head(cubist_imp_sorted,10))

##                          Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess29  65.04854
## ManufacturingProcess13  46.60194
## BiologicalMaterial02    40.77670
## ManufacturingProcess27  36.89320
## ManufacturingProcess04  33.98058
## BiologicalMaterial03    31.06796
## ManufacturingProcess33  27.18447
## ManufacturingProcess17  24.27184
## ManufacturingProcess31  23.30097

We see that once again Process variables dominate the list 8 to 2/ Now to compare it with our optimal linear and and non linear models from prior homework assignments

Our best performing Linear Lasso model had variable importance order

ManufacturingProcess32
ManufacturingProcess13
BiologicalMaterial06
BiologicalMaterial03
ManufacturingProcess09
ManufacturingProcess17
ManufacturingProcess36
BiologicalMaterial02
BiologicalMaterial12
ManufacturingProcess31

Our best performing non linear SVM model had the order of

ManufacturingProcess32
ManufacturingProcess13
BiologicalMaterial06
ManufacturingProcess36
ManufacturingProcess09
ManufacturingProcess17
ManufacturingProcess31
BiologicalMaterial03
BiologicalMaterial02
ManufacturingProcess06

We see that the Cubist model emphasizes process variables even more strongly than previous models going 8:2 compared to Linear lassos 6:4 or Non linear SVMs 7:3. It also introduces several new process predictors that weren’t top-ranked in Lasso or SVM like ManufacturingProcess39, 04, 37, 16 to replace biological predictors. Manufacturing 32,13 and 9 retain their top 5 spots across all three model showing their strong influence on yield.

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

treeModel_chem <- rpart(Yield ~ ., data = chemTrain, method = "anova")
rpart.plot(treeModel_chem, type = 3, extra = 101, fallen.leaves = TRUE)

We see the first split at Manufacturing process 32 as expected. Here it shows us how yield is affected in further splits with Biological material 11 and Manufacturing process 13, After the second split another biological factor only appears in a tiny portion of a split while the rest of the tree is dominated by manufacturing processes on either sides. This is a great visualisation that helps us understand how each of the predictors affect the yield and how wide their influence is and on which side of the yield line.

Assignment9

2025-11-19

8.1

8.2

8.3

8.7