DATA 624: Assignment 09

Author

Curtis Elsasser

Published

April 27, 2025

Setup

library(caret)
library(corrplot)
library(mlbench)
library(party)
library(randomForest)
library(rJava)
library(RWeka)
library(rpart)
library(rpart.plot)
library(tidyverse)

set.seed(314159)

Assignment

Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

M5 vs. Cubist

I had trouble with the RWeka package. It seems that for Macs that the full JDK needs to be installed. I like Java from a distance, but locally it’s orthogonal to the way applications work in an operating system. So I opted for another regression model - Cubist. I think it is a good substitute for M5. It is rule-based regression that uses linear regression to make predictions.

Update: I have since installed the full JDK and have been able to run M5. I will include it in the results. It did not perform as well as I thought it would. It only outperformed CART on the training set. Cubist is still our champion and he’s not a tree-based model. We shall tuck him in and wake him for next week’s assignment.

Problem 8.1

Recreate the simulated data from Exercise 7.2:

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
# My goodness, it took a while to understand how this works.
colnames(simulated)[ncol(simulated)] <- "y"

8.1.a

Fit a random forest model to all of the predictors, then estimate the variable importance scores:

model <- randomForest(
  y ~ .,
  data = simulated,
  importance = TRUE,
  ntree = 1000
)
imp1 <- varImp(model, scale = FALSE) |>
  arrange(desc(Overall))
imp1

        Overall
V4  12.06367073
V2   8.60313246
V1   8.49257256
V5   2.91084899
V3   1.06489047
V7   0.04315761
V8   0.02033607
V9   0.01368384
V10  0.00239764
V6  -0.20056409

Did the random forest model significantly use the uninformative predictors (V6– V10)?

I asked the Oracle what the units of varImp are and it depends on the model. For random forest, caret uses MDA (Mean Decrease in Accuracy) which is how much accuracy is lost when a variable is permuted. The sum of the first three most significant variables (V4, V2 and V1)…we can calculate exactly how much they outdistance the rest of the variables’ importance.

s1 <- sum(imp1$Overall[1:3])
s2 <- sum(imp1$Overall[4:nrow(imp1)])
paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))

[1] "Ratio of 3 most important variables =  0.883239359336319"

As can be seen the three most important variables, which are not uninformative predictors, have collectively ~89% of the overall importance.

8.1.b

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simcpy <- simulated
simcpy$duplicate1 <- simcpy$V4 + rnorm(200) * .1
cor(simcpy$duplicate1, simcpy$V4)

[1] 0.9498567

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model <- randomForest(
  y ~ .,
  data = simcpy,
  importance = TRUE,
  ntree = 1000
)
imp2 <- varImp(model, scale = FALSE) |>
  arrange(desc(Overall))
imp2

               Overall
V2          8.20552963
V4          7.99952034
V1          6.95579791
duplicate1  6.09891479
V5          2.50339382
V3          1.12478131
V9          0.07797291
V8          0.01182104
V10        -0.07809073
V7         -0.08405116
V6         -0.11529284

Ah, very interesting. It looks as if the importance of V4 (correlated variable) is approximately 2/3’s of what it was: from 12.83994465 to 8.195829010. And the correlated variable upon which V4 was based has an overall score of 5.230211991. When V4 and duplicate1 are summed, they approximate V4’s original overall score (reduction in accuracy). Nonetheless, the uninformative predictors are still not among the significant variables.

8.1.c

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

imp3 <- cforest(
  formula = y ~ .,
  data = simulated
) |>
  varimp() |>
  as.data.frame()

colnames(imp3) <- c("score")
imp3 <- arrange(imp3, desc(score))

As can be seen, the 3 most important variables are V4, V2 and V1. This is the same as with randomForest. And, looking at varimp’s documentation, we see that it also returns MDA. The distribution of the score is a little different, nonetheless I think the sum of the first 3 and their ratio to the rest of the variables will be nearly the same. Let’s see:

s1 <- sum(imp3[1:3, 1])
s2 <- sum(imp3[4:nrow(imp3), 1])
paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))

[1] "Ratio of 3 most important variables =  0.92566045035994"

Using party’s random forest we get a very slight increase in the percentage of importance (3 degrees points).

8.1.d

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boost

imp4 <- varImp(train(
  y ~ .,
  data = simulated,
  model = "gbm"
))$importance |>
  arrange(desc(Overall))

print(imp4)

       Overall
V4  100.000000
V2   58.727094
V1   54.572149
V5   21.335167
V3    8.129586
V7    2.392881
V9    2.303282
V8    1.484307
V6    0.514422
V10   0.000000

s1 <- sum(imp4[1:3, 1])
s2 <- sum(imp4[4:nrow(imp4), 1])
paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))

[1] "Ratio of 3 most important variables =  0.855047675757029"

The performance of boosted is slower, but the order of the first three variables is the same: V4, V2 and V1. I’m not sure what the units of varImp for boost is, nonetheless if we look at the ratio of the sum of the first 3 measurements to the sum of all, we find that it is ~86%, which is slightly less less than randomForest’s percentage.

Cubist

imp5 <- varImp(train(
  y ~ .,
  data = simulated,
  model = "cubist"
))$importance |>
  arrange(desc(Overall))

print(imp5)

        Overall
V4  100.0000000
V2   57.7656579
V1   52.5984595
V5   20.7686059
V3    7.5742126
V7    2.1878799
V9    2.0147359
V8    1.3946398
V10   0.2677225
V6    0.0000000

s1 <- sum(imp5[1:3, 1])
s2 <- sum(imp5[4:nrow(imp5), 1])
paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))

[1] "Ratio of 3 most important variables =  0.860131950603612"

The order of the first three variables of importance are the same: V4, V2 and V1. The ratio of importance is almost identical to boost. Let’s compare their results. Actually, let’s compare all 4 models’ results:

data.frame(
  RFVariable = rownames(imp1),
  RFScore = imp1$Overall,
  cRFVariable = rownames(imp3),
  cRFScore = imp3$score,
  BoostVariable = rownames(imp4),
  BoostScore = imp4$Overall,
  CubistVariable = rownames(imp5),
  CubistScore = imp5$Overall
)

   RFVariable     RFScore cRFVariable    cRFScore BoostVariable BoostScore
1          V4 12.06367073          V4 13.72910219            V4 100.000000
2          V2  8.60313246          V2  8.84083652            V2  58.727094
3          V1  8.49257256          V1  7.85429210            V1  54.572149
4          V5  2.91084899          V5  2.29682903            V5  21.335167
5          V3  1.06489047          V3  0.15394511            V3   8.129586
6          V7  0.04315761          V9  0.09548485            V7   2.392881
7          V8  0.02033607          V8  0.02418386            V9   2.303282
8          V9  0.01368384          V7 -0.03629831            V8   1.484307
9         V10  0.00239764         V10 -0.03825558            V6   0.514422
10         V6 -0.20056409          V6 -0.05252690           V10   0.000000
   CubistVariable CubistScore
1              V4 100.0000000
2              V2  57.7656579
3              V1  52.5984595
4              V5  20.7686059
5              V3   7.5742126
6              V7   2.1878799
7              V9   2.0147359
8              V8   1.3946398
9             V10   0.2677225
10             V6   0.0000000

The order of the first 4 variables is identical for all 4 models. And with those 4 variables lies more than 90% of the importance.

Problem 8.2

Use a simulation to show tree bias with different granularities.

A simulation? A simulation of our own making. And different granularities. That could mean within a dataset or across different versions of the same dataset. Let’s go with the latter and see how it goes. First we shall create our simulated dataset. We shall create a dataframe with 5 variables. Each with a different distribution and granularity which I take to mean the number of values that would be in a ranges histogram.

n <- 100
df <- data.frame(
  N1 = rnorm(n, mean = 0, sd = 1),
  N2 = rnorm(n, mean = 5, sd = 5),
  B1 = rbinom(n, size = 1, prob = 0.5),
  S1 = sample(1:10, n, replace = TRUE),
  S2 = sample(c("cat", "dog", "fish"), n, replace=TRUE)
) |>
  mutate(
    Y = ((N1 + N2 + B1 + S1) / 4) * rnorm(n, 1, sd = 0.2)
  )

head(df, n=10)

           N1          N2 B1 S1   S2         Y
1  -1.1706107  4.76404334  1  5  dog 2.0665263
2  -0.6374881  7.76551363  0  2  dog 2.0889133
3   1.1974370 -4.40449669  1  6  cat 1.0331775
4  -1.4399896  1.82297967  0  2 fish 0.5440164
5  -0.4549463  2.30127971  1  4 fish 1.7565751
6  -0.2126887  2.39962847  1  4 fish 1.4491595
7   0.1732326 -5.03996154  1  8 fish 1.3540085
8  -0.7320212  2.97503481  1  5 fish 1.5839733
9   0.8889294 -0.01300527  0  4 fish 1.2143430
10  1.2237585 -1.07088589  1  3 fish 1.0745994

Now we shall build our tree. Let’s start with basic tree regression and see how that goes.

result <- rpart(
  Y ~ .,
  data = df
)
varImp(result) |>
  arrange(desc(Overall)) |>
  print()

     Overall
N2 2.8371014
S1 1.2905754
S2 0.6750443
N1 0.5590290
B1 0.5368589

rpart.plot(result, main = "Regression Tree for Simulation")

Well, we see that our most granular variable, N2, is also our most important variable. And our second most important variable is not our second most granular variable, N1, is not our second most important variable. It’s S1. Let’s just see what our correlation looks like.

df |>
  select(-S2) |>
  cor() |>
  corrplot(method = "number")

Well, N2 and Y show the highest amount of correlation. Perhaps we should go about this in a different way. Let’s generate a dataset with the same range but different amounts of granularity and then see which variable rises to the surface.

n <- 100
df <- data.frame(
  X1 = seq(1, 10, length.out = n),
  X2 = sample(seq(1, 10, length.out = 50), n, replace = TRUE),
  X3 = sample(seq(1, 10, length.out = 25), n, replace = TRUE),
  X4 = sample(seq(1, 10, length.out = 10), n, replace = TRUE),
  X5 = sample(seq(1, 10, length.out = 5), n, replace = TRUE)
) |>
  mutate(
    Y = ((X1 + X2 + X3 + X4 + X5) / 5) * rnorm(n, 1, sd = 1.5)
  )

head(df, n=10)

         X1       X2     X3 X4    X5         Y
1  1.000000 2.469388  6.625  7 10.00 -3.082943
2  1.090909 5.591837  3.625  9 10.00  5.328984
3  1.181818 9.448980  8.500  8  5.50 -1.662956
4  1.272727 6.693878  7.375  7  7.75  9.194237
5  1.363636 8.163265 10.000  6  1.00 10.029741
6  1.454545 1.918367  3.625  6  5.50  4.192727
7  1.545455 6.142857  4.750  1  7.75 23.012309
8  1.636364 9.081633  4.375  6  3.25 15.824952
9  1.727273 1.183673  8.125  2  3.25 -6.040846
10 1.818182 1.551020  1.375  8  1.00  4.536610

The order of highest to lowest granularity is: X1, X2, … X5. Let’s try our basic regression tree again.

result <- rpart(
  Y ~ .,
  data = df
)
varImp(result) |>
  arrange(desc(Overall)) |>
  print()

     Overall
X2 0.4795873
X1 0.4109932
X4 0.2470219
X3 0.2159749
X5 0.1918303

rpart.plot(result, main = "Regression Tree for Simulation")

The order is not in order of granularity, though X1 is the first variable. Let’s put it in a forest and see what happens.

randomForest(
  Y ~ .,
  data = df,
  importance = TRUE,
  ntree = 1000
) |>
  varImp() |>
  arrange(desc(Overall)) |>
  print()

     Overall
X2 12.263572
X1  5.847418
X5  4.324250
X3 -0.322365
X4 -5.680305

And with random forest we see the second least most granular predictor, X4, as the most important. And the least granular, X5, as the second most important. I’m inclined to say that importance isn’t tied to granularity. Before we move on, let’s take a look at correlation.

cor(df) |>
  corrplot(method = "number")

X4 has the most correlation with Y. It seems like random forest is pretty good at finding the most correlated variable.

Problem 8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

8.3.a

Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The model on the right focuses on the first few parameters because we are encouraging it with a high learning rate and high bagging fraction. The bagging fraction = 0.9 causes 90% of the training dataset to be used for every bootstrap sample (does not use replacement). This can lead to overfitting because with such a high bagging fraction, the model is essentially being trained on a single sample. Boosting’s credo is to repeatedly grow shallow trees to the residuals. To be honest, I don’t know exactly know how the depth of each shallow tree is kept shallow. One way to keep them naturally shallow is to use small bagging fractions.

Additionally, the high learning rate works against unbiased fitting. It’s going to cause the model to be trained on fewer trees because the learning rate influences the contribution of each tree to the final average. I’d be interested in knowing how much less biased the model would be if the learning rate were to be reduced while keeping the bagging fraction the same. My suspicion is that it wouldn’t have much of an effect due the the extremely high bagging fraction.

8.3.b

Which model do you think would be more predictive of other samples?

I think the model on the left would do a better job of predicting another sample from the same dataset. As was just argued in the previous question, the model on the right was fit very tightly to the sample on which it was trained. The model on the left adhered to “slow and easy wins the race”. It worked with smaller samples per tree (bagging fraction) and with a much smaller learning rate. It very likely took longer to train, but I believe that the payoff will be in its ability to generalize.

8.3.c

How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?

Boosting likes small subsets of training data to train shallow trees. It helps to prevent overfitting by encouraging diversity through weak learners. Weak learners perform slightly better than random guessing. Boosting accomplishes this through it’s preference for small subsets of training data to train shallow trees.

So, what would happen to these models if the interaction depth were increased? For the model on the right, I don’t believe it would change very much. Each tree is being trained on 90% of the training set. The variation from one sample to another is going to be small. As I mentioned before, I think the model on the right approximates the training of a single tree on the entire training dataset.

How would the model on the left change? I think more is at risk with the model on the left. Each tree is going to fit more closely to the bootstrapped sample upon which it is trained. And because each bootstrap sample is 10% of the entire training dataset, I think it’s going to be capturing the noise induced by the low bagging fraction. My guess is that it would have a neutralizing affect on the importance of each predictor, but of this I’m not certain.

Problem 8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

From ChemicalManufacturingProcess’s vignette:

The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by–products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.

Prepare the data:

library(AppliedPredictiveModeling)

# load the data
data(ChemicalManufacturingProcess)
# impute it
pp <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
df <- predict(pp, ChemicalManufacturingProcess)
# drop near-zero variance columns
nz <- nearZeroVar(df, saveMetrics = FALSE)
df <- df[, -nz]
# split it
x <- df[, -1]
y <- df$Yield
index <- createDataPartition(y, p = 0.7, list = FALSE)
train_x <- x[index, ]
train_y <- y[index]
test_x <- x[-index, ]
test_y <- y[-index]
# Note: I don't think we need to center and scale it. One beautiful things about
# decision trees is that they work fine with un-centered and unscaled data.
glimpse(train_x)

Rows: 124
Columns: 56
$ BiologicalMaterial01   <dbl> -0.2261036, 2.2391498, 2.2391498, -0.4081962, 0…
$ BiologicalMaterial02   <dbl> -1.5140979, 1.3089960, 1.3089960, 0.6620886, 1.…
$ BiologicalMaterial03   <dbl> -2.68303622, -0.05623504, -0.05623504, -0.59859…
$ BiologicalMaterial04   <dbl> 0.22017653, 1.29643860, 1.29643860, 1.58945236,…
$ BiologicalMaterial05   <dbl> 0.49419417, 0.41285548, 0.41285548, 1.73054229,…
$ BiologicalMaterial06   <dbl> -1.3828880, 1.1290767, 1.1290767, 0.6192092, 1.…
$ BiologicalMaterial08   <dbl> -1.23313130, 2.28261904, 2.28261904, 1.18948658…
$ BiologicalMaterial09   <dbl> -3.39628945, -0.72272251, -0.72272251, -1.73434…
$ BiologicalMaterial10   <dbl> 1.10052962, 1.10052962, 1.10052962, 1.63462551,…
$ BiologicalMaterial11   <dbl> -1.8386550, 1.3933948, 1.3933948, 1.0220617, 1.…
$ BiologicalMaterial12   <dbl> -1.77092243, 1.09898554, 1.09898554, 0.72408765…
$ ManufacturingProcess01 <dbl> 0.21541048, -6.14970280, -6.14970280, 0.4348971…
$ ManufacturingProcess02 <dbl> 0.5662872, -1.9692525, -1.9692525, -1.9692525, …
$ ManufacturingProcess03 <dbl> 0.37658102, 0.10870380, 0.46587343, 0.55516584,…
$ ManufacturingProcess04 <dbl> 0.56555981, -3.16385635, -3.32323311, -1.251335…
$ ManufacturingProcess05 <dbl> -0.44593467, 0.06246417, 0.42279841, 0.49486525…
$ ManufacturingProcess06 <dbl> -0.54149966, -0.11177452, 2.18503223, 0.5550403…
$ ManufacturingProcess07 <dbl> -0.1596700, 1.0378549, -0.9580199, 1.0378549, 1…
$ ManufacturingProcess08 <dbl> -0.3095182, 0.8941637, -1.1119728, 0.8941637, 0…
$ ManufacturingProcess09 <dbl> -1.72015237, -0.38159474, -0.47859167, -0.21993…
$ ManufacturingProcess10 <dbl> -0.07700901, 0.31428424, -0.02483658, 0.2881980…
$ ManufacturingProcess11 <dbl> -0.09157342, 0.55112383, 0.80261406, 1.41736795…
$ ManufacturingProcess12 <dbl> -0.4806937, -0.4806937, -0.4806937, -0.4806937,…
$ ManufacturingProcess13 <dbl> 0.977115122, 0.287650156, 0.287650156, -0.50030…
$ ManufacturingProcess14 <dbl> 0.809399878, 0.442586519, 0.791059210, 2.405037…
$ ManufacturingProcess15 <dbl> 1.1846438, 0.8245152, 1.0817499, 3.1396277, 1.1…
$ ManufacturingProcess16 <dbl> 0.330394508, 0.145576496, 0.196756869, 0.626103…
$ ManufacturingProcess17 <dbl> 0.92632958, 0.36552464, 0.36552464, -0.75608522…
$ ManufacturingProcess18 <dbl> 0.150534784, 0.183189824, 0.169583557, 0.142371…
$ ManufacturingProcess19 <dbl> 0.456379761, 1.092643705, 0.982943025, 1.904428…
$ ManufacturingProcess20 <dbl> 0.31099421, 0.18492296, 0.15627040, 0.39981714,…
$ ManufacturingProcess21 <dbl> 0.21098038, 0.21098038, 0.21098038, -0.55993755…
$ ManufacturingProcess22 <dbl> 0.05833309, -0.42205706, -0.12181322, 1.0791621…
$ ManufacturingProcess23 <dbl> 0.83176879, -1.21328258, -0.61179688, -1.213282…
$ ManufacturingProcess24 <dbl> 0.89072909, -0.83358055, -0.66114958, -1.350873…
$ ManufacturingProcess25 <dbl> 0.120018290, 0.184278572, 0.170891014, 0.114663…
$ ManufacturingProcess26 <dbl> 0.125634745, 0.215983072, 0.205227319, 0.241796…
$ ManufacturingProcess27 <dbl> 0.346035210, 0.210436193, 0.190661337, 0.351685…
$ ManufacturingProcess28 <dbl> 0.7826636, 0.8588688, 0.8588688, 0.9160227, 0.8…
$ ManufacturingProcess29 <dbl> 0.5943242, 0.7746248, 0.7746248, 1.0150257, 0.7…
$ ManufacturingProcess30 <dbl> 0.75669482, 0.24444298, 0.24444298, 0.96159555,…
$ ManufacturingProcess31 <dbl> -0.19525522, -0.15925668, -0.15925668, -0.35724…
$ ManufacturingProcess32 <dbl> -0.45688288, 2.69287190, 2.32231251, 2.69287190…
$ ManufacturingProcess33 <dbl> 0.9890307, 0.9890307, 1.7943843, 2.5997378, 0.5…
$ ManufacturingProcess34 <dbl> -1.7202722, 1.9568096, 0.1182687, 0.1182687, 0.…
$ ManufacturingProcess35 <dbl> -0.88694718, 1.23880740, 0.03729394, -0.5172507…
$ ManufacturingProcess36 <dbl> -0.6557774, -1.8000420, -1.8000420, -1.8000420,…
$ ManufacturingProcess37 <dbl> -1.15402429, -0.70466970, 0.41871678, -1.378701…
$ ManufacturingProcess38 <dbl> 0.7174727, -0.8224687, -0.8224687, -0.8224687, …
$ ManufacturingProcess39 <dbl> 0.23172698, 0.23172698, 0.23172698, 0.23172698,…
$ ManufacturingProcess40 <dbl> 0.05969714, -0.46265281, -0.46265281, -0.462652…
$ ManufacturingProcess41 <dbl> -0.06900773, -0.44058781, -0.44058781, -0.44058…
$ ManufacturingProcess42 <dbl> 0.202795698, 0.408810375, -0.312240995, 0.15129…
$ ManufacturingProcess43 <dbl> 2.40564734, 0.10146268, 0.21667191, 1.48397347,…
$ ManufacturingProcess44 <dbl> -0.01588055, -0.01588055, -0.01588055, -0.01588…
$ ManufacturingProcess45 <dbl> 0.64371849, 0.39796046, -0.09355562, -0.3393136…

tc = trainControl(
  method = "cv",
  number = 10,
)

testModel <- function(model) {
  pred <- predict(model, newdata = test_x)
  rmse <- RMSE(pred, test_y)
  return(rmse = rmse)
}

result_df <- data.frame(
  Model = c("CART", "M5", "Bagged Tree", "Random Forest", "Boosted Tree", "Cubist"),
  Train_RMSE = c(NA, NA, NA, NA, NA, NA),
  Test_RMSE = c(NA, NA, NA, NA, NA, NA)
)

CART

model_cart <- train(
  x = train_x,
  y = train_y,
  method = "rpart",
  trControl = tc
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

result_df[1, 2] <- min(model_cart$results$RMSE)
result_df[1, 3] <- testModel(model_cart)
model_cart

CART 

124 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 112, 112, 111, 111, 112, 112, ... 
Resampling results across tuning parameters:

  cp         RMSE       Rsquared   MAE      
  0.0688919  0.7035582  0.4927953  0.5725373
  0.0920897  0.7076556  0.4845886  0.5745254
  0.4637273  0.9464529  0.2154154  0.7733660

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.0688919.

M5

model_m5 <- train(
  x = train_x,
  y = train_y,
  method = "M5",
  trControl = tc
)
result_df[2, 2] <- min(model_m5$results$RMSE)
result_df[2, 3] <- testModel(model_m5)
model_m5

Model Tree 

124 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 112, 111, 112, 112, 112, 112, ... 
Resampling results across tuning parameters:

  pruned  smoothed  rules  RMSE       Rsquared   MAE      
  Yes     Yes       Yes    0.6565724  0.6084554  0.5239816
  Yes     Yes       No     0.6893963  0.6138990  0.5404636
  Yes     No        Yes    0.6966411  0.5742321  0.5599403
  Yes     No        No     0.6919621  0.6000161  0.5690915
  No      Yes       Yes    0.7620329  0.4107222  0.6000468
  No      Yes       No     0.6703017  0.6315406  0.5121346
  No      No        Yes    0.8604658  0.3313407  0.6687136
  No      No        No     0.7434227  0.5176227  0.5659559

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were pruned = Yes, smoothed = Yes and
 rules = Yes.

Bagged Tree

model_bt <- train(
  x = train_x,
  y = train_y,
  method = "treebag",
  mtry = ncol(train_x) %/% 3,
  trControl = tc
)
result_df[3, 2] <- min(model_bt$results$RMSE)
result_df[3, 3] <- testModel(model_bt)
model_bt

Bagged CART 

124 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 112, 112, 112, 111, 112, 112, ... 
Resampling results:

  RMSE       Rsquared   MAE      
  0.6332627  0.6102629  0.4873854

Random Forest

model_rf <- train(
  x = train_x,
  y = train_y,
  method = "cforest",
  trControl = tc
)
result_df[4, 2] <- min(model_rf$results$RMSE)
result_df[4, 3] <- testModel(model_rf)
model_rf

Conditional Inference Random Forest 

124 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 112, 112, 112, 112, 112, 111, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared   MAE      
   2    0.7712408  0.5351633  0.6294213
  29    0.6184405  0.6230003  0.4940317
  56    0.6259547  0.5949818  0.4971699

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.

Boosted Tree

The following algorithm is stochastic gradient boosting.

model_boost <- train(
  x = train_x,
  y = train_y,
  method = "gbm",
  trControl = tc,
  verbose = FALSE
)
result_df[5, 2] <- min(model_boost$results$RMSE)
result_df[5, 3] <- testModel(model_boost)
model_boost

Stochastic Gradient Boosting 

124 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 112, 112, 111, 112, 112, 112, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  RMSE       Rsquared   MAE      
  1                   50      0.5989487  0.6534244  0.4637958
  1                  100      0.5654100  0.6791074  0.4337559
  1                  150      0.5554661  0.6872176  0.4234216
  2                   50      0.5857188  0.6712308  0.4435407
  2                  100      0.5728711  0.6762988  0.4324605
  2                  150      0.5671489  0.6882745  0.4210209
  3                   50      0.6062131  0.6506900  0.4726932
  3                  100      0.5763791  0.6842797  0.4463029
  3                  150      0.5616737  0.6979875  0.4326664

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
 1, shrinkage = 0.1 and n.minobsinnode = 10.

Cubist

model_cubist <- train(
  x = train_x,
  y = train_y,
  method = "cubist",
  trControl = tc
)
result_df[6, 2] <- min(model_cubist$results$RMSE)
result_df[6, 3] <- testModel(model_cubist)
model_cubist

Cubist 

124 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 112, 111, 111, 112, 112, 112, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE       Rsquared   MAE      
   1          0          0.6008791  0.6354875  0.4837303
   1          5          0.5536052  0.6986003  0.4436491
   1          9          0.5769907  0.6680799  0.4592846
  10          0          0.5811680  0.6357486  0.4556277
  10          5          0.5308447  0.6926929  0.4202194
  10          9          0.5458991  0.6706151  0.4320630
  20          0          0.5939586  0.6211443  0.4675100
  20          5          0.5415648  0.6787550  0.4261568
  20          9          0.5577433  0.6570586  0.4410662

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 10 and neighbors = 5.

8.7.a

Which tree-based regression model gives the optimal resampling and test set performance?

We shall use the RMSE as an indicator of performance and present each in order of best to worst.

Let’s see which gave the best resampling performance.

result_df |>
  arrange(Train_RMSE)

          Model Train_RMSE Test_RMSE
1        Cubist  0.5308447 0.6094330
2  Boosted Tree  0.5554661 0.7614677
3 Random Forest  0.6184405 0.7678373
4   Bagged Tree  0.6332627 0.7798096
5            M5  0.6565724 1.8965912
6          CART  0.7035582 0.8951656

Let’s see which gave the best testing performance.

result_df |>
  arrange(Test_RMSE)

          Model Train_RMSE Test_RMSE
1        Cubist  0.5308447 0.6094330
2  Boosted Tree  0.5554661 0.7614677
3 Random Forest  0.6184405 0.7678373
4   Bagged Tree  0.6332627 0.7798096
5          CART  0.7035582 0.8951656
6            M5  0.6565724 1.8965912

8.7.b

Which predictors are most important in the optimal tree-based regression model?

Cubist was our best performing model on both our training data and test data. So we will examine its predictors.

imp <- varImp(model_cubist)$importance |>
  as.data.frame() |>
  filter(Overall > 0) |>
  arrange(desc(Overall))
imp

                          Overall
ManufacturingProcess32 100.000000
ManufacturingProcess33  55.319149
ManufacturingProcess29  51.063830
ManufacturingProcess09  45.744681
ManufacturingProcess04  40.425532
ManufacturingProcess17  39.361702
ManufacturingProcess26  39.361702
BiologicalMaterial06    31.914894
ManufacturingProcess30  29.787234
ManufacturingProcess20  23.404255
BiologicalMaterial03    20.212766
ManufacturingProcess11  15.957447
ManufacturingProcess13  15.957447
ManufacturingProcess27  14.893617
ManufacturingProcess24  10.638298
BiologicalMaterial01     9.574468
ManufacturingProcess19   9.574468
ManufacturingProcess25   9.574468
BiologicalMaterial12     6.382979
BiologicalMaterial02     6.382979
ManufacturingProcess15   4.255319

Do either the biological or process variables dominate the list?

Let’s consider the top 10 most important predictors.

print("Manufacturing top-10 count: 6")

[1] "Manufacturing top-10 count: 6"

print("Biological top-10 count: 4")

[1] "Biological top-10 count: 4"

paste("Manufacturing/Biological total ratio:",
  sum(startsWith(rownames(imp), "Manufacturing")),
  "/",
  sum(startsWith(rownames(imp), "Biological"))
)

[1] "Manufacturing/Biological total ratio: 16 / 5"

print("Manufacturing/Biological top-10 ratio: 3/2")

[1] "Manufacturing/Biological top-10 ratio: 3/2"

Welly, welly, welly, among the top 10 there are 1.5 times more manufacturing predictors than there are biological. But, there are ~4 times as many manufacturing predictors as there are biological predictors. No predictor dominates, but one could certainly say that biological is proportionately more significant.

How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Linear	Non-Linear	Cubist
ManufacturingProcess32 = 100.00	ManufacturingProcess32 = 100.00	ManufacturingProcess32 = 100.000000
ManufacturingProcess04 = 84.21	BiologicalMaterial06 = 80.33	ManufacturingProcess09 = 71.590909
ManufacturingProcess28 = 80.46	BiologicalMaterial02 = 72.54	BiologicalMaterial06 = 50.000000
ManufacturingProcess43 = 64.85	ManufacturingProcess13 = 70.64	ManufacturingProcess17 = 40.909091
ManufacturingProcess33 = 62.49	ManufacturingProcess36 = 60.52	BiologicalMaterial10 = 40.909091
ManufacturingProcess12 = 57.85	BiologicalMaterial03 = 60.25	ManufacturingProcess33 = 36.363636
ManufacturingProcess35 = 38.26	ManufacturingProcess31 = 58.55	ManufacturingProcess13 = 36.363636
ManufacturingProcess37 = 38.03	ManufacturingProcess09 = 54.30	BiologicalMaterial04 = 32.954545
BiologicalMaterial11 = 37.20	BiologicalMaterial04 = 53.08	ManufacturingProcess31 = 29.545455
ManufacturingProcess10 = 36.51	BiologicalMaterial12 = 51.95	ManufacturingProcess04 = 29.545455

All have identified ManufacturingProcess32 as the most important predictor. From there it varies with surprisingly little overlap. Somewhat interestingly, the first predictor is considerably less important to cubist than the second most predictor is to linear and nonlinear.

8.7.c

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

Oh dear, I just realized that 8.7 is asking us to use “tree-based” models. Up to this question, I did not realize that. I looked for a non-Java alternative to M5, but didn’t find one. I am going to perform this plot with our boosted algorithm (runner up). But the boosted algorithm is not a single tree. The only two single tree algorithms I have are CART and M5. M5 beat CART so, I’ll use M5. This question is not going so well. I cannot figure out how to plot the distribution of yield in the terminal nodes of an M5 model. It doesn’t totally fulfill the question’s request to plot the optimal single tree, nonetheless I am going to work with CART as I know that rpart.plot will plot the yield.

rpart.plot(model_cart$finalModel, main = "CART Tree")

That’s very surprising. The first split is on ManufacturingProcess32 and it is the only split? This doesn’t seem like it can be right, but I don’t see anything wrong. If we look at the variable importance for CART:

varImp(model_cart)$importance |>
  as.data.frame() |>
  filter(Overall > 0) |>
  arrange(desc(Overall)) |>
  print()

                         Overall
BiologicalMaterial03   100.00000
BiologicalMaterial06    95.81838
ManufacturingProcess32  87.00209
BiologicalMaterial12    55.15929
ManufacturingProcess36  52.02182
BiologicalMaterial02    51.14444
BiologicalMaterial08    46.10175
BiologicalMaterial11    44.85281

We see that ManufacturingProcess32 is the most important predictor; so, it makes sense that it would be the first split. But, there are four subsequent variables that are significant. Let’s assume that it is correct. The distribution of yield in the terminal nodes is indicated in the plot. It’s a 60/40 split. No, it doesn’t provide any additional knowledge about the biological or process predictors and their relationship with yield.