library(mlbench)
library(randomForest)
library(caret)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(AppliedPredictiveModeling)
library(rpart.plot)
library(partykit)

8.1

set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##          Overall
## V1   8.732235404
## V2   6.415369387
## V3   0.763591825
## V4   7.615118809
## V5   2.023524577
## V6   0.165111172
## V7  -0.005961659
## V8  -0.166362581
## V9  -0.095292651
## V10 -0.074944788

The uninformative predictors (V6-V10) have importance scores near zero or even slightly negative, which means that the random forest has basically ignored them. The informative predictors (V1-V5) all score meaningfully higher, with V1, V4, and V2 being the ones that stand out the most. So no, the random forest did not significantly use the uninformative predictors.

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
print("Duplicate1 corr:")
## [1] "Duplicate1 corr:"
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9460206
model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                Overall
## V1          5.69119973
## V2          6.06896061
## V3          0.62970218
## V4          7.04752238
## V5          1.87238438
## V6          0.13569065
## V7         -0.01345645
## V8         -0.04370565
## V9          0.00840438
## V10         0.02894814
## duplicate1  4.28331581
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
print("Duplicate2 corr:")
## [1] "Duplicate2 corr:"
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9408631
model3 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3
##                Overall
## V1          4.91687329
## V2          6.52816504
## V3          0.58711552
## V4          7.04870917
## V5          2.03115561
## V6          0.14213148
## V7          0.10991985
## V8         -0.08405687
## V9         -0.01075028
## V10         0.09230576
## duplicate1  3.80068234
## duplicate2  1.87721959

Yes, V1’s importance dropped from 8.73 in model1 to 5.69 in model2 after adding duplicate1 (r = 0.946), with the importance now split between V1 (5.69) and duplicate1 (4.28). Adding a second correlated predictor pushed V1 down further to 4.92 in model3, with importance now spread across V1 (4.92), duplicate1 (3.80), and duplicate2 (1.88). Each time a correlated predictor is added, the forest has more candidates to choose from at each split, so the importance of the original predictor gets diluted. This is a known limitation of the traditional random forest importance measure, it is not reliable when predictors are highly correlated, and can understate the importance of the true predictor.

#Remove duplicates first
simulated_orig <- simulated[, !names(simulated) %in% c("duplicate1", "duplicate2")]

#Fit cforest
set.seed(200)
model_cforest <- cforest(y ~ ., data = simulated_orig)

#Traditional importance
imp_traditional <- varimp(model_cforest, conditional = FALSE)
imp_traditional
##          V1          V2          V3          V4          V5          V6 
##  8.04166864  6.21307352  0.23458171  7.55930190  2.13338717  0.07895189 
##          V7          V8          V9         V10 
##  0.21504772 -0.30008835 -0.04506062 -0.24685120
#Conditional importance
imp_conditional <- varimp(model_cforest, conditional = TRUE)
imp_conditional
##         V1         V2         V3         V4         V5         V6         V7 
##  5.8454101  5.0272517  0.2313670  6.2533209  1.4521394 -0.2014898 -0.2234096 
##         V8         V9        V10 
## -0.2420770 -0.2634202 -0.1054899

The traditional cforest importance shows a very similar pattern to the original random forest, where V1-V5 rank highest, and V6-V10 are all near zero, so there is not a difference there. For conditional importance, the scores are lower overall, which is expected since conditional importance accounts for correlations between predictors. The ordering of highest to lowest remains relatively similar, and the uninformative predictors remain near zero. The conditional measure is more conservative but tells the same story here, and since there aren’t highly correlated predictors in this version of the data, the two measures end up agreeing. The difference would be more obvious if this wass repeated with the duplicates included, where the conditional importance would do a better job of attributing importance to the true predictor rather than splitting it across correlated ones.

#Boosted trees
set.seed(200)
gbmModel <- train(y ~ ., data = simulated_orig, method = "gbm", trControl = trainControl(method = "cv", number = 10), verbose = FALSE)
gbmImp <- varImp(gbmModel, scale = FALSE)
gbmImp
## gbm variable importance
## 
##     Overall
## V4  4657.39
## V1  3681.79
## V2  3384.53
## V5  1822.31
## V3  1194.51
## V7   179.01
## V6   167.85
## V9   141.95
## V8    82.39
## V10   80.25
#Cubist
set.seed(200)
cubistModel <- train(y ~ ., data = simulated_orig, method = "cubist", trControl = trainControl(method = "cv", number = 10))

cubistImp <- varImp(cubistModel, scale = FALSE)
cubistImp
## cubist variable importance
## 
##     Overall
## V1     70.5
## V2     54.5
## V4     50.0
## V5     47.5
## V3     41.0
## V7      0.0
## V6      0.0
## V10     0.0
## V8      0.0
## V9      0.0

Both models show the same pattern as the random forest, where V1-V5 are clearly the important predictors and V6-V10 score near or at zero. Cubist is especially clean, where V6–V10 scored exactly zero. The ranking shifts slightly between models, but, ultimately, all the tree-based methods successfully ignore the uninformative predictors. This is consistent with what we saw from the conditional inference forest.

8.2

set.seed(200)
n <- 200

#Predictors with different granularities, none related to y
x_continuous <- runif(n, 0, 1) #200 unique values
x_few <- sample(1:4, n, replace = TRUE) #4 unique values
x_binary <- sample(0:1, n, replace = TRUE) #2 unique values

#y is purely random, no predictor is truly informative
y <- rnorm(n)
df <- data.frame(y, x_continuous, x_few, x_binary)

tree <- rpart(y ~ ., data = df)
varImp(tree)
##                Overall
## x_binary     0.0970479
## x_continuous 0.4530006
## x_few        0.2061587

I simulated three predictors with no true relationship to a random response: a continuous predictor (200 unique values), a 4-level predictor, and a binary predictor. Since none are truly informative, any importance assigned by the tree is purely due to bias. The results follow the order of granularity exactly: x_continuous (0.453), x_few (0.206), x_binary (0.097). This shows that trees favor predictors with more unique values simply because they offer more potential split points.

8.3

  1. The right model (0.9) takes large steps and sees almost all the data at each iteration, so the strongest predictors get picked repeatedly and end up with much higher importance scores, because the model will move in the direction of the steepest gradient (which will be the strongest predictors because of the bigger learning rate). The scale on the right plot goes to over 1000 compared to around 600 on the left, which shows how concentrated the importance is. At 0.1, smaller steps and random subsampling mean that weaker predictors get more chances to contribute, thus, spreading importance a bit more evenly across all predictors shown.

  2. I would think the left model (0.1) would likely be the more predictive one on other samples. The right model relies so heavily on only a couple of predictors that it might be missing useful information from the others. A more balanced approach seems more preferable.

  3. Increasing the interaction depth would have different effects on both models. In the case of the left, deeper trees would probably spread the importance more evenly, since the smaller steps and subsampling will prevent any one predictor from dominating. For the right one, since it has a higher learning rate, it will just exacerbate the drop-off between the strongest predictors and the rest of the others. So in essence, the interaction depth would just increase the model tendencies that we’ve already observed.

8.7

data(ChemicalManufacturingProcess)

preProcValues <- preProcess(ChemicalManufacturingProcess[, -1], method = "knnImpute")
imputedPredictors <- predict(preProcValues, ChemicalManufacturingProcess[, -1])
yield <- ChemicalManufacturingProcess$Yield

set.seed(33)
trainingRows <- createDataPartition(yield, p = .80, list = FALSE)
trainX <- imputedPredictors[trainingRows, ]
trainY <- yield[trainingRows]
testX <- imputedPredictors[-trainingRows, ]
testY <- yield[-trainingRows]

ctrl <- trainControl(method = "cv", number = 10)
#Random Forest
set.seed(33)
rfModel87 <- train(x = trainX, y = trainY, method = "rf", trControl = ctrl, importance = TRUE, ntree = 1000)
#Boosted Trees
set.seed(33)
gbmModel87 <- train(x = trainX, y = trainY, method = "gbm", trControl = ctrl, verbose = FALSE)
#Cubist
set.seed(33)
cubistModel87 <- train(x = trainX, y = trainY, method = "cubist", trControl = ctrl)
#Resampling comparison
results87 <- resamples(list(RF = rfModel87, GBM = gbmModel87, Cubist = cubistModel87))
print("Resampling performance")
## [1] "Resampling performance"
summary(results87)
## 
## Call:
## summary.resamples(object = results87)
## 
## Models: RF, GBM, Cubist 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RF     0.6455615 0.7022419 0.9401330 0.9019678 1.0565499 1.236606    0
## GBM    0.6093052 0.8165041 0.8885082 0.8844033 0.9259963 1.238539    0
## Cubist 0.5590419 0.6326796 0.7592646 0.7719877 0.8066334 1.164862    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RF     0.7845847 0.8349505 1.1248046 1.1245102 1.3617565 1.535984    0
## GBM    0.7787470 1.0203021 1.0917912 1.1051583 1.1828297 1.444436    0
## Cubist 0.6707862 0.8511431 0.9049166 0.9754323 0.9488943 1.621567    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF     0.5081443 0.5828635 0.6582416 0.6718742 0.7774701 0.8127047    0
## GBM    0.5887260 0.6215905 0.6411545 0.6617221 0.6844222 0.7826017    0
## Cubist 0.4556878 0.7148040 0.7279097 0.7357621 0.8081979 0.8744603    0
#Test set performance
print("Test set performance")
## [1] "Test set performance"
print("Random Forest:")
## [1] "Random Forest:"
postResample(pred = predict(rfModel87, testX), obs = testY)
##      RMSE  Rsquared       MAE 
## 1.0607903 0.6953833 0.8001382
print("GBM:")
## [1] "GBM:"
postResample(pred = predict(gbmModel87, testX), obs = testY)
##      RMSE  Rsquared       MAE 
## 1.0675323 0.6932065 0.7760023
print("Cubist:")
## [1] "Cubist:"
postResample(pred = predict(cubistModel87, testX), obs = testY)
##      RMSE  Rsquared       MAE 
## 0.8238009 0.8449741 0.6499340

Based on these results, Cubist gives the best overall performance on both the resampling and on the test set. It had the lowest RMSE (0.973) in the resampling, and in the test set it also had the lowest RMSE (0.824), as well as highest R-squared at 0.845.

#Variable importance of best model (Cubist)
impCubist87 <- varImp(cubistModel87, scale = FALSE)
plot(impCubist87, top = 20)

top10_tree <- rownames(impCubist87$importance)[order(impCubist87$importance[,1], decreasing = TRUE)][1:10]
print(top10_tree)
##  [1] "ManufacturingProcess32" "ManufacturingProcess33" "ManufacturingProcess17"
##  [4] "BiologicalMaterial03"   "ManufacturingProcess09" "ManufacturingProcess28"
##  [7] "ManufacturingProcess27" "ManufacturingProcess29" "BiologicalMaterial12"  
## [10] "ManufacturingProcess13"
#Compare to linear (PLS) and nonlinear (SVM) top 10
top10_pls <- c("ManufacturingProcess32", "ManufacturingProcess13",
               "ManufacturingProcess09", "ManufacturingProcess36",
               "ManufacturingProcess17", "BiologicalMaterial02",
               "ManufacturingProcess11", "BiologicalMaterial06",
               "ManufacturingProcess06", "ManufacturingProcess33")

top10_svm <- c("ManufacturingProcess32", "ManufacturingProcess13",
               "ManufacturingProcess17", "BiologicalMaterial06",
               "BiologicalMaterial03", "BiologicalMaterial02",
               "ManufacturingProcess36", "ManufacturingProcess09",
               "BiologicalMaterial12", "ManufacturingProcess31")

unique_to_tree <- setdiff(top10_tree, union(top10_pls, top10_svm))
cat("Predictors unique to tree model top 10:\n")
## Predictors unique to tree model top 10:
print(unique_to_tree)
## [1] "ManufacturingProcess28" "ManufacturingProcess27" "ManufacturingProcess29"

ManufacturingProcess32 is the most important predictor by a large margin, which has been consistent across all three models from the previous chapters. Overall, process variables dominated the top ten, with seven of the ten being manufacturing process predictors. Seven predictors are shared with either PLS or SVM, and the three unique to Cubist are ManufacturingProcess27, ManufacturingProcess28, and ManufacturingProcess29. Once again, the dominance of process variables over biological predictors is consistent across all models.

#Plot the optimal single tree with yield distribution in terminal nodes
set.seed(33)
singleTree87 <- rpart(y ~ ., data = cbind(trainX, y = trainY))

#Plot with terminal node distributions
singleTree87_party <- as.party(singleTree87)
plot(singleTree87_party)

Overall, the tree splits first on ManufacturingProcess32, which is consistent with its top ranking across all models. Batches below 0.192 tend toward lower yield, while those at or above trend higher, which is clearly visible in the terminal node boxplots, which also follow an upwards trend. Interestingly, biological predictors like BiologicalMaterial11 and BiologicalMaterial05 appear in the left branch, suggesting that when ManufacturingProcess32 is low, raw material quality becomes more relevant. The right branch is dominated entirely by process variables. Overall, the tree shows more or less the same story, which is that ManufacturingProcess32 affects yield most strongly, and with the process variables overall being more important, but biological predictors can still provide some signal under certain conditions.