library(caret)
library(corrplot)
library(mlbench)
library(party)
library(randomForest)
library(rJava)
library(RWeka)
library(rpart)
library(rpart.plot)
library(tidyverse)
set.seed(314159)
DATA 624: Assignment 09
Setup
Assignment
Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
M5 vs. Cubist
I had trouble with the RWeka
package. It seems that for Macs that the full JDK needs to be installed. I like Java from a distance, but locally it’s orthogonal to the way applications work in an operating system. So I opted for another regression model - Cubist. I think it is a good substitute for M5. It is rule-based regression that uses linear regression to make predictions.
Update: I have since installed the full JDK and have been able to run M5. I will include it in the results. It did not perform as well as I thought it would. It only outperformed CART on the training set. Cubist is still our champion and he’s not a tree-based model. We shall tuck him in and wake him for next week’s assignment.
Problem 8.1
Recreate the simulated data from Exercise 7.2:
<- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
simulated # My goodness, it took a while to understand how this works.
colnames(simulated)[ncol(simulated)] <- "y"
8.1.a
Fit a random forest model to all of the predictors, then estimate the variable importance scores:
<- randomForest(
model ~ .,
y data = simulated,
importance = TRUE,
ntree = 1000
)<- varImp(model, scale = FALSE) |>
imp1 arrange(desc(Overall))
imp1
Overall
V4 12.06367073
V2 8.60313246
V1 8.49257256
V5 2.91084899
V3 1.06489047
V7 0.04315761
V8 0.02033607
V9 0.01368384
V10 0.00239764
V6 -0.20056409
Did the random forest model significantly use the uninformative predictors (V6– V10)?
I asked the Oracle what the units of varImp
are and it depends on the model. For random forest, caret
uses MDA (Mean Decrease in Accuracy) which is how much accuracy is lost when a variable is permuted. The sum of the first three most significant variables (V4
, V2
and V1
)…we can calculate exactly how much they outdistance the rest of the variables’ importance.
<- sum(imp1$Overall[1:3])
s1 <- sum(imp1$Overall[4:nrow(imp1)])
s2 paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))
[1] "Ratio of 3 most important variables = 0.883239359336319"
As can be seen the three most important variables, which are not uninformative predictors, have collectively ~89% of the overall importance.
8.1.b
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
<- simulated
simcpy $duplicate1 <- simcpy$V4 + rnorm(200) * .1
simcpycor(simcpy$duplicate1, simcpy$V4)
[1] 0.9498567
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
<- randomForest(
model ~ .,
y data = simcpy,
importance = TRUE,
ntree = 1000
)<- varImp(model, scale = FALSE) |>
imp2 arrange(desc(Overall))
imp2
Overall
V2 8.20552963
V4 7.99952034
V1 6.95579791
duplicate1 6.09891479
V5 2.50339382
V3 1.12478131
V9 0.07797291
V8 0.01182104
V10 -0.07809073
V7 -0.08405116
V6 -0.11529284
Ah, very interesting. It looks as if the importance of V4
(correlated variable) is approximately 2/3’s of what it was: from 12.83994465 to 8.195829010. And the correlated variable upon which V4
was based has an overall score of 5.230211991. When V4
and duplicate1
are summed, they approximate V4
’s original overall score (reduction in accuracy). Nonetheless, the uninformative predictors are still not among the significant variables.
8.1.c
Use the cforest
function in the party
package to fit a random forest model using conditional inference trees. The party
package function varimp
can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
<- cforest(
imp3 formula = y ~ .,
data = simulated
|>
) varimp() |>
as.data.frame()
colnames(imp3) <- c("score")
<- arrange(imp3, desc(score)) imp3
As can be seen, the 3 most important variables are V4
, V2
and V1
. This is the same as with randomForest
. And, looking at varimp
’s documentation, we see that it also returns MDA. The distribution of the score is a little different, nonetheless I think the sum of the first 3 and their ratio to the rest of the variables will be nearly the same. Let’s see:
<- sum(imp3[1:3, 1])
s1 <- sum(imp3[4:nrow(imp3), 1])
s2 paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))
[1] "Ratio of 3 most important variables = 0.92566045035994"
Using party
’s random forest we get a very slight increase in the percentage of importance (3 degrees points).
8.1.d
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Boost
<- varImp(train(
imp4 ~ .,
y data = simulated,
model = "gbm"
$importance |>
))arrange(desc(Overall))
print(imp4)
Overall
V4 100.000000
V2 58.727094
V1 54.572149
V5 21.335167
V3 8.129586
V7 2.392881
V9 2.303282
V8 1.484307
V6 0.514422
V10 0.000000
<- sum(imp4[1:3, 1])
s1 <- sum(imp4[4:nrow(imp4), 1])
s2 paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))
[1] "Ratio of 3 most important variables = 0.855047675757029"
The performance of boosted is slower, but the order of the first three variables is the same: V4
, V2
and V1
. I’m not sure what the units of varImp
for boost is, nonetheless if we look at the ratio of the sum of the first 3 measurements to the sum of all, we find that it is ~86%, which is slightly less less than randomForest
’s percentage.
Cubist
<- varImp(train(
imp5 ~ .,
y data = simulated,
model = "cubist"
$importance |>
))arrange(desc(Overall))
print(imp5)
Overall
V4 100.0000000
V2 57.7656579
V1 52.5984595
V5 20.7686059
V3 7.5742126
V7 2.1878799
V9 2.0147359
V8 1.3946398
V10 0.2677225
V6 0.0000000
<- sum(imp5[1:3, 1])
s1 <- sum(imp5[4:nrow(imp5), 1])
s2 paste("Ratio of 3 most important variables = ", s1 / (s1 + s2))
[1] "Ratio of 3 most important variables = 0.860131950603612"
The order of the first three variables of importance are the same: V4
, V2
and V1
. The ratio of importance is almost identical to boost. Let’s compare their results. Actually, let’s compare all 4 models’ results:
data.frame(
RFVariable = rownames(imp1),
RFScore = imp1$Overall,
cRFVariable = rownames(imp3),
cRFScore = imp3$score,
BoostVariable = rownames(imp4),
BoostScore = imp4$Overall,
CubistVariable = rownames(imp5),
CubistScore = imp5$Overall
)
RFVariable RFScore cRFVariable cRFScore BoostVariable BoostScore
1 V4 12.06367073 V4 13.72910219 V4 100.000000
2 V2 8.60313246 V2 8.84083652 V2 58.727094
3 V1 8.49257256 V1 7.85429210 V1 54.572149
4 V5 2.91084899 V5 2.29682903 V5 21.335167
5 V3 1.06489047 V3 0.15394511 V3 8.129586
6 V7 0.04315761 V9 0.09548485 V7 2.392881
7 V8 0.02033607 V8 0.02418386 V9 2.303282
8 V9 0.01368384 V7 -0.03629831 V8 1.484307
9 V10 0.00239764 V10 -0.03825558 V6 0.514422
10 V6 -0.20056409 V6 -0.05252690 V10 0.000000
CubistVariable CubistScore
1 V4 100.0000000
2 V2 57.7656579
3 V1 52.5984595
4 V5 20.7686059
5 V3 7.5742126
6 V7 2.1878799
7 V9 2.0147359
8 V8 1.3946398
9 V10 0.2677225
10 V6 0.0000000
The order of the first 4 variables is identical for all 4 models. And with those 4 variables lies more than 90% of the importance.
Problem 8.2
Use a simulation to show tree bias with different granularities.
A simulation? A simulation of our own making. And different granularities. That could mean within a dataset or across different versions of the same dataset. Let’s go with the latter and see how it goes. First we shall create our simulated dataset. We shall create a dataframe with 5 variables. Each with a different distribution and granularity which I take to mean the number of values that would be in a ranges histogram.
<- 100
n <- data.frame(
df N1 = rnorm(n, mean = 0, sd = 1),
N2 = rnorm(n, mean = 5, sd = 5),
B1 = rbinom(n, size = 1, prob = 0.5),
S1 = sample(1:10, n, replace = TRUE),
S2 = sample(c("cat", "dog", "fish"), n, replace=TRUE)
|>
) mutate(
Y = ((N1 + N2 + B1 + S1) / 4) * rnorm(n, 1, sd = 0.2)
)
head(df, n=10)
N1 N2 B1 S1 S2 Y
1 -1.1706107 4.76404334 1 5 dog 2.0665263
2 -0.6374881 7.76551363 0 2 dog 2.0889133
3 1.1974370 -4.40449669 1 6 cat 1.0331775
4 -1.4399896 1.82297967 0 2 fish 0.5440164
5 -0.4549463 2.30127971 1 4 fish 1.7565751
6 -0.2126887 2.39962847 1 4 fish 1.4491595
7 0.1732326 -5.03996154 1 8 fish 1.3540085
8 -0.7320212 2.97503481 1 5 fish 1.5839733
9 0.8889294 -0.01300527 0 4 fish 1.2143430
10 1.2237585 -1.07088589 1 3 fish 1.0745994
Now we shall build our tree. Let’s start with basic tree regression and see how that goes.
<- rpart(
result ~ .,
Y data = df
)varImp(result) |>
arrange(desc(Overall)) |>
print()
Overall
N2 2.8371014
S1 1.2905754
S2 0.6750443
N1 0.5590290
B1 0.5368589
rpart.plot(result, main = "Regression Tree for Simulation")
Well, we see that our most granular variable, N2
, is also our most important variable. And our second most important variable is not our second most granular variable, N1
, is not our second most important variable. It’s S1
. Let’s just see what our correlation looks like.
|>
df select(-S2) |>
cor() |>
corrplot(method = "number")
Well, N2
and Y
show the highest amount of correlation. Perhaps we should go about this in a different way. Let’s generate a dataset with the same range but different amounts of granularity and then see which variable rises to the surface.
<- 100
n <- data.frame(
df X1 = seq(1, 10, length.out = n),
X2 = sample(seq(1, 10, length.out = 50), n, replace = TRUE),
X3 = sample(seq(1, 10, length.out = 25), n, replace = TRUE),
X4 = sample(seq(1, 10, length.out = 10), n, replace = TRUE),
X5 = sample(seq(1, 10, length.out = 5), n, replace = TRUE)
|>
) mutate(
Y = ((X1 + X2 + X3 + X4 + X5) / 5) * rnorm(n, 1, sd = 1.5)
)
head(df, n=10)
X1 X2 X3 X4 X5 Y
1 1.000000 2.469388 6.625 7 10.00 -3.082943
2 1.090909 5.591837 3.625 9 10.00 5.328984
3 1.181818 9.448980 8.500 8 5.50 -1.662956
4 1.272727 6.693878 7.375 7 7.75 9.194237
5 1.363636 8.163265 10.000 6 1.00 10.029741
6 1.454545 1.918367 3.625 6 5.50 4.192727
7 1.545455 6.142857 4.750 1 7.75 23.012309
8 1.636364 9.081633 4.375 6 3.25 15.824952
9 1.727273 1.183673 8.125 2 3.25 -6.040846
10 1.818182 1.551020 1.375 8 1.00 4.536610
The order of highest to lowest granularity is: X1
, X2
, … X5
. Let’s try our basic regression tree again.
<- rpart(
result ~ .,
Y data = df
)varImp(result) |>
arrange(desc(Overall)) |>
print()
Overall
X2 0.4795873
X1 0.4109932
X4 0.2470219
X3 0.2159749
X5 0.1918303
rpart.plot(result, main = "Regression Tree for Simulation")
The order is not in order of granularity, though X1
is the first variable. Let’s put it in a forest and see what happens.
randomForest(
~ .,
Y data = df,
importance = TRUE,
ntree = 1000
|>
) varImp() |>
arrange(desc(Overall)) |>
print()
Overall
X2 12.263572
X1 5.847418
X5 4.324250
X3 -0.322365
X4 -5.680305
And with random forest we see the second least most granular predictor, X4
, as the most important. And the least granular, X5
, as the second most important. I’m inclined to say that importance isn’t tied to granularity. Before we move on, let’s take a look at correlation.
cor(df) |>
corrplot(method = "number")
X4
has the most correlation with Y
. It seems like random forest is pretty good at finding the most correlated variable.
Problem 8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
8.3.a
Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?
The model on the right focuses on the first few parameters because we are encouraging it with a high learning rate and high bagging fraction. The bagging fraction = 0.9 causes 90% of the training dataset to be used for every bootstrap sample (does not use replacement). This can lead to overfitting because with such a high bagging fraction, the model is essentially being trained on a single sample. Boosting’s credo is to repeatedly grow shallow trees to the residuals. To be honest, I don’t know exactly know how the depth of each shallow tree is kept shallow. One way to keep them naturally shallow is to use small bagging fractions.
Additionally, the high learning rate works against unbiased fitting. It’s going to cause the model to be trained on fewer trees because the learning rate influences the contribution of each tree to the final average. I’d be interested in knowing how much less biased the model would be if the learning rate were to be reduced while keeping the bagging fraction the same. My suspicion is that it wouldn’t have much of an effect due the the extremely high bagging fraction.
8.3.b
Which model do you think would be more predictive of other samples?
I think the model on the left would do a better job of predicting another sample from the same dataset. As was just argued in the previous question, the model on the right was fit very tightly to the sample on which it was trained. The model on the left adhered to “slow and easy wins the race”. It worked with smaller samples per tree (bagging fraction) and with a much smaller learning rate. It very likely took longer to train, but I believe that the payoff will be in its ability to generalize.
8.3.c
How would increasing interaction depth affect the slope of predictor importance for either model in Fig.8.24?
Boosting likes small subsets of training data to train shallow trees. It helps to prevent overfitting by encouraging diversity through weak learners. Weak learners perform slightly better than random guessing. Boosting accomplishes this through it’s preference for small subsets of training data to train shallow trees.
So, what would happen to these models if the interaction depth were increased? For the model on the right, I don’t believe it would change very much. Each tree is being trained on 90% of the training set. The variation from one sample to another is going to be small. As I mentioned before, I think the model on the right approximates the training of a single tree on the entire training dataset.
How would the model on the left change? I think more is at risk with the model on the left. Each tree is going to fit more closely to the bootstrapped sample upon which it is trained. And because each bootstrap sample is 10% of the entire training dataset, I think it’s going to be capturing the noise induced by the low bagging fraction. My guess is that it would have a neutralizing affect on the importance of each predictor, but of this I’m not certain.
Problem 8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
From ChemicalManufacturingProcess’s vignette:
The data set consisted of 177 samples of biological material for which 57 characteristics were measured. Of the 57 characteristics, there were 12 measurements of the biological starting material, and 45 measurements of the manufacturing process. The process variables included measurements such as temperature, drying time, washing time, and concentrations of by–products at various steps. Some of the process measurements can be controlled, while others are observed. Predictors are continuous, count, categorical; some are correlated, and some contain missing values. Samples are not independent because sets of samples come from the same batch of biological starting material.
Prepare the data:
library(AppliedPredictiveModeling)
# load the data
data(ChemicalManufacturingProcess)
# impute it
<- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
pp <- predict(pp, ChemicalManufacturingProcess)
df # drop near-zero variance columns
<- nearZeroVar(df, saveMetrics = FALSE)
nz <- df[, -nz]
df # split it
<- df[, -1]
x <- df$Yield
y <- createDataPartition(y, p = 0.7, list = FALSE)
index <- x[index, ]
train_x <- y[index]
train_y <- x[-index, ]
test_x <- y[-index]
test_y # Note: I don't think we need to center and scale it. One beautiful things about
# decision trees is that they work fine with un-centered and unscaled data.
glimpse(train_x)
Rows: 124
Columns: 56
$ BiologicalMaterial01 <dbl> -0.2261036, 2.2391498, 2.2391498, -0.4081962, 0…
$ BiologicalMaterial02 <dbl> -1.5140979, 1.3089960, 1.3089960, 0.6620886, 1.…
$ BiologicalMaterial03 <dbl> -2.68303622, -0.05623504, -0.05623504, -0.59859…
$ BiologicalMaterial04 <dbl> 0.22017653, 1.29643860, 1.29643860, 1.58945236,…
$ BiologicalMaterial05 <dbl> 0.49419417, 0.41285548, 0.41285548, 1.73054229,…
$ BiologicalMaterial06 <dbl> -1.3828880, 1.1290767, 1.1290767, 0.6192092, 1.…
$ BiologicalMaterial08 <dbl> -1.23313130, 2.28261904, 2.28261904, 1.18948658…
$ BiologicalMaterial09 <dbl> -3.39628945, -0.72272251, -0.72272251, -1.73434…
$ BiologicalMaterial10 <dbl> 1.10052962, 1.10052962, 1.10052962, 1.63462551,…
$ BiologicalMaterial11 <dbl> -1.8386550, 1.3933948, 1.3933948, 1.0220617, 1.…
$ BiologicalMaterial12 <dbl> -1.77092243, 1.09898554, 1.09898554, 0.72408765…
$ ManufacturingProcess01 <dbl> 0.21541048, -6.14970280, -6.14970280, 0.4348971…
$ ManufacturingProcess02 <dbl> 0.5662872, -1.9692525, -1.9692525, -1.9692525, …
$ ManufacturingProcess03 <dbl> 0.37658102, 0.10870380, 0.46587343, 0.55516584,…
$ ManufacturingProcess04 <dbl> 0.56555981, -3.16385635, -3.32323311, -1.251335…
$ ManufacturingProcess05 <dbl> -0.44593467, 0.06246417, 0.42279841, 0.49486525…
$ ManufacturingProcess06 <dbl> -0.54149966, -0.11177452, 2.18503223, 0.5550403…
$ ManufacturingProcess07 <dbl> -0.1596700, 1.0378549, -0.9580199, 1.0378549, 1…
$ ManufacturingProcess08 <dbl> -0.3095182, 0.8941637, -1.1119728, 0.8941637, 0…
$ ManufacturingProcess09 <dbl> -1.72015237, -0.38159474, -0.47859167, -0.21993…
$ ManufacturingProcess10 <dbl> -0.07700901, 0.31428424, -0.02483658, 0.2881980…
$ ManufacturingProcess11 <dbl> -0.09157342, 0.55112383, 0.80261406, 1.41736795…
$ ManufacturingProcess12 <dbl> -0.4806937, -0.4806937, -0.4806937, -0.4806937,…
$ ManufacturingProcess13 <dbl> 0.977115122, 0.287650156, 0.287650156, -0.50030…
$ ManufacturingProcess14 <dbl> 0.809399878, 0.442586519, 0.791059210, 2.405037…
$ ManufacturingProcess15 <dbl> 1.1846438, 0.8245152, 1.0817499, 3.1396277, 1.1…
$ ManufacturingProcess16 <dbl> 0.330394508, 0.145576496, 0.196756869, 0.626103…
$ ManufacturingProcess17 <dbl> 0.92632958, 0.36552464, 0.36552464, -0.75608522…
$ ManufacturingProcess18 <dbl> 0.150534784, 0.183189824, 0.169583557, 0.142371…
$ ManufacturingProcess19 <dbl> 0.456379761, 1.092643705, 0.982943025, 1.904428…
$ ManufacturingProcess20 <dbl> 0.31099421, 0.18492296, 0.15627040, 0.39981714,…
$ ManufacturingProcess21 <dbl> 0.21098038, 0.21098038, 0.21098038, -0.55993755…
$ ManufacturingProcess22 <dbl> 0.05833309, -0.42205706, -0.12181322, 1.0791621…
$ ManufacturingProcess23 <dbl> 0.83176879, -1.21328258, -0.61179688, -1.213282…
$ ManufacturingProcess24 <dbl> 0.89072909, -0.83358055, -0.66114958, -1.350873…
$ ManufacturingProcess25 <dbl> 0.120018290, 0.184278572, 0.170891014, 0.114663…
$ ManufacturingProcess26 <dbl> 0.125634745, 0.215983072, 0.205227319, 0.241796…
$ ManufacturingProcess27 <dbl> 0.346035210, 0.210436193, 0.190661337, 0.351685…
$ ManufacturingProcess28 <dbl> 0.7826636, 0.8588688, 0.8588688, 0.9160227, 0.8…
$ ManufacturingProcess29 <dbl> 0.5943242, 0.7746248, 0.7746248, 1.0150257, 0.7…
$ ManufacturingProcess30 <dbl> 0.75669482, 0.24444298, 0.24444298, 0.96159555,…
$ ManufacturingProcess31 <dbl> -0.19525522, -0.15925668, -0.15925668, -0.35724…
$ ManufacturingProcess32 <dbl> -0.45688288, 2.69287190, 2.32231251, 2.69287190…
$ ManufacturingProcess33 <dbl> 0.9890307, 0.9890307, 1.7943843, 2.5997378, 0.5…
$ ManufacturingProcess34 <dbl> -1.7202722, 1.9568096, 0.1182687, 0.1182687, 0.…
$ ManufacturingProcess35 <dbl> -0.88694718, 1.23880740, 0.03729394, -0.5172507…
$ ManufacturingProcess36 <dbl> -0.6557774, -1.8000420, -1.8000420, -1.8000420,…
$ ManufacturingProcess37 <dbl> -1.15402429, -0.70466970, 0.41871678, -1.378701…
$ ManufacturingProcess38 <dbl> 0.7174727, -0.8224687, -0.8224687, -0.8224687, …
$ ManufacturingProcess39 <dbl> 0.23172698, 0.23172698, 0.23172698, 0.23172698,…
$ ManufacturingProcess40 <dbl> 0.05969714, -0.46265281, -0.46265281, -0.462652…
$ ManufacturingProcess41 <dbl> -0.06900773, -0.44058781, -0.44058781, -0.44058…
$ ManufacturingProcess42 <dbl> 0.202795698, 0.408810375, -0.312240995, 0.15129…
$ ManufacturingProcess43 <dbl> 2.40564734, 0.10146268, 0.21667191, 1.48397347,…
$ ManufacturingProcess44 <dbl> -0.01588055, -0.01588055, -0.01588055, -0.01588…
$ ManufacturingProcess45 <dbl> 0.64371849, 0.39796046, -0.09355562, -0.3393136…
= trainControl(
tc method = "cv",
number = 10,
)
<- function(model) {
testModel <- predict(model, newdata = test_x)
pred <- RMSE(pred, test_y)
rmse return(rmse = rmse)
}
<- data.frame(
result_df Model = c("CART", "M5", "Bagged Tree", "Random Forest", "Boosted Tree", "Cubist"),
Train_RMSE = c(NA, NA, NA, NA, NA, NA),
Test_RMSE = c(NA, NA, NA, NA, NA, NA)
)
CART
<- train(
model_cart x = train_x,
y = train_y,
method = "rpart",
trControl = tc
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
1, 2] <- min(model_cart$results$RMSE)
result_df[1, 3] <- testModel(model_cart)
result_df[ model_cart
CART
124 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 112, 112, 111, 111, 112, 112, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0688919 0.7035582 0.4927953 0.5725373
0.0920897 0.7076556 0.4845886 0.5745254
0.4637273 0.9464529 0.2154154 0.7733660
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.0688919.
M5
<- train(
model_m5 x = train_x,
y = train_y,
method = "M5",
trControl = tc
)2, 2] <- min(model_m5$results$RMSE)
result_df[2, 3] <- testModel(model_m5)
result_df[ model_m5
Model Tree
124 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 112, 111, 112, 112, 112, 112, ...
Resampling results across tuning parameters:
pruned smoothed rules RMSE Rsquared MAE
Yes Yes Yes 0.6565724 0.6084554 0.5239816
Yes Yes No 0.6893963 0.6138990 0.5404636
Yes No Yes 0.6966411 0.5742321 0.5599403
Yes No No 0.6919621 0.6000161 0.5690915
No Yes Yes 0.7620329 0.4107222 0.6000468
No Yes No 0.6703017 0.6315406 0.5121346
No No Yes 0.8604658 0.3313407 0.6687136
No No No 0.7434227 0.5176227 0.5659559
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were pruned = Yes, smoothed = Yes and
rules = Yes.
Bagged Tree
<- train(
model_bt x = train_x,
y = train_y,
method = "treebag",
mtry = ncol(train_x) %/% 3,
trControl = tc
)3, 2] <- min(model_bt$results$RMSE)
result_df[3, 3] <- testModel(model_bt)
result_df[ model_bt
Bagged CART
124 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 112, 112, 112, 111, 112, 112, ...
Resampling results:
RMSE Rsquared MAE
0.6332627 0.6102629 0.4873854
Random Forest
<- train(
model_rf x = train_x,
y = train_y,
method = "cforest",
trControl = tc
)4, 2] <- min(model_rf$results$RMSE)
result_df[4, 3] <- testModel(model_rf)
result_df[ model_rf
Conditional Inference Random Forest
124 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 112, 112, 112, 112, 112, 111, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.7712408 0.5351633 0.6294213
29 0.6184405 0.6230003 0.4940317
56 0.6259547 0.5949818 0.4971699
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 29.
Boosted Tree
The following algorithm is stochastic gradient boosting.
<- train(
model_boost x = train_x,
y = train_y,
method = "gbm",
trControl = tc,
verbose = FALSE
)5, 2] <- min(model_boost$results$RMSE)
result_df[5, 3] <- testModel(model_boost)
result_df[ model_boost
Stochastic Gradient Boosting
124 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 112, 112, 111, 112, 112, 112, ...
Resampling results across tuning parameters:
interaction.depth n.trees RMSE Rsquared MAE
1 50 0.5989487 0.6534244 0.4637958
1 100 0.5654100 0.6791074 0.4337559
1 150 0.5554661 0.6872176 0.4234216
2 50 0.5857188 0.6712308 0.4435407
2 100 0.5728711 0.6762988 0.4324605
2 150 0.5671489 0.6882745 0.4210209
3 50 0.6062131 0.6506900 0.4726932
3 100 0.5763791 0.6842797 0.4463029
3 150 0.5616737 0.6979875 0.4326664
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 150, interaction.depth =
1, shrinkage = 0.1 and n.minobsinnode = 10.
Cubist
<- train(
model_cubist x = train_x,
y = train_y,
method = "cubist",
trControl = tc
)6, 2] <- min(model_cubist$results$RMSE)
result_df[6, 3] <- testModel(model_cubist)
result_df[ model_cubist
Cubist
124 samples
56 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 112, 111, 111, 112, 112, 112, ...
Resampling results across tuning parameters:
committees neighbors RMSE Rsquared MAE
1 0 0.6008791 0.6354875 0.4837303
1 5 0.5536052 0.6986003 0.4436491
1 9 0.5769907 0.6680799 0.4592846
10 0 0.5811680 0.6357486 0.4556277
10 5 0.5308447 0.6926929 0.4202194
10 9 0.5458991 0.6706151 0.4320630
20 0 0.5939586 0.6211443 0.4675100
20 5 0.5415648 0.6787550 0.4261568
20 9 0.5577433 0.6570586 0.4410662
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were committees = 10 and neighbors = 5.
8.7.a
Which tree-based regression model gives the optimal resampling and test set performance?
We shall use the RMSE as an indicator of performance and present each in order of best to worst.
Let’s see which gave the best resampling performance.
|>
result_df arrange(Train_RMSE)
Model Train_RMSE Test_RMSE
1 Cubist 0.5308447 0.6094330
2 Boosted Tree 0.5554661 0.7614677
3 Random Forest 0.6184405 0.7678373
4 Bagged Tree 0.6332627 0.7798096
5 M5 0.6565724 1.8965912
6 CART 0.7035582 0.8951656
Let’s see which gave the best testing performance.
|>
result_df arrange(Test_RMSE)
Model Train_RMSE Test_RMSE
1 Cubist 0.5308447 0.6094330
2 Boosted Tree 0.5554661 0.7614677
3 Random Forest 0.6184405 0.7678373
4 Bagged Tree 0.6332627 0.7798096
5 CART 0.7035582 0.8951656
6 M5 0.6565724 1.8965912
8.7.b
Which predictors are most important in the optimal tree-based regression model?
Cubist was our best performing model on both our training data and test data. So we will examine its predictors.
<- varImp(model_cubist)$importance |>
imp as.data.frame() |>
filter(Overall > 0) |>
arrange(desc(Overall))
imp
Overall
ManufacturingProcess32 100.000000
ManufacturingProcess33 55.319149
ManufacturingProcess29 51.063830
ManufacturingProcess09 45.744681
ManufacturingProcess04 40.425532
ManufacturingProcess17 39.361702
ManufacturingProcess26 39.361702
BiologicalMaterial06 31.914894
ManufacturingProcess30 29.787234
ManufacturingProcess20 23.404255
BiologicalMaterial03 20.212766
ManufacturingProcess11 15.957447
ManufacturingProcess13 15.957447
ManufacturingProcess27 14.893617
ManufacturingProcess24 10.638298
BiologicalMaterial01 9.574468
ManufacturingProcess19 9.574468
ManufacturingProcess25 9.574468
BiologicalMaterial12 6.382979
BiologicalMaterial02 6.382979
ManufacturingProcess15 4.255319
Do either the biological or process variables dominate the list?
Let’s consider the top 10 most important predictors.
print("Manufacturing top-10 count: 6")
[1] "Manufacturing top-10 count: 6"
print("Biological top-10 count: 4")
[1] "Biological top-10 count: 4"
paste("Manufacturing/Biological total ratio:",
sum(startsWith(rownames(imp), "Manufacturing")),
"/",
sum(startsWith(rownames(imp), "Biological"))
)
[1] "Manufacturing/Biological total ratio: 16 / 5"
print("Manufacturing/Biological top-10 ratio: 3/2")
[1] "Manufacturing/Biological top-10 ratio: 3/2"
Welly, welly, welly, among the top 10 there are 1.5 times more manufacturing predictors than there are biological. But, there are ~4 times as many manufacturing predictors as there are biological predictors. No predictor dominates, but one could certainly say that biological is proportionately more significant.
How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
Linear | Non-Linear | Cubist |
---|---|---|
ManufacturingProcess32 = 100.00 | ManufacturingProcess32 = 100.00 | ManufacturingProcess32 = 100.000000 |
ManufacturingProcess04 = 84.21 | BiologicalMaterial06 = 80.33 | ManufacturingProcess09 = 71.590909 |
ManufacturingProcess28 = 80.46 | BiologicalMaterial02 = 72.54 | BiologicalMaterial06 = 50.000000 |
ManufacturingProcess43 = 64.85 | ManufacturingProcess13 = 70.64 | ManufacturingProcess17 = 40.909091 |
ManufacturingProcess33 = 62.49 | ManufacturingProcess36 = 60.52 | BiologicalMaterial10 = 40.909091 |
ManufacturingProcess12 = 57.85 | BiologicalMaterial03 = 60.25 | ManufacturingProcess33 = 36.363636 |
ManufacturingProcess35 = 38.26 | ManufacturingProcess31 = 58.55 | ManufacturingProcess13 = 36.363636 |
ManufacturingProcess37 = 38.03 | ManufacturingProcess09 = 54.30 | BiologicalMaterial04 = 32.954545 |
BiologicalMaterial11 = 37.20 | BiologicalMaterial04 = 53.08 | ManufacturingProcess31 = 29.545455 |
ManufacturingProcess10 = 36.51 | BiologicalMaterial12 = 51.95 | ManufacturingProcess04 = 29.545455 |
All have identified ManufacturingProcess32
as the most important predictor. From there it varies with surprisingly little overlap. Somewhat interestingly, the first predictor is considerably less important to cubist than the second most predictor is to linear and nonlinear.
8.7.c
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
Oh dear, I just realized that 8.7 is asking us to use “tree-based” models. Up to this question, I did not realize that. I looked for a non-Java alternative to M5, but didn’t find one. I am going to perform this plot with our boosted algorithm (runner up). But the boosted algorithm is not a single tree. The only two single tree algorithms I have are CART and M5. M5 beat CART so, I’ll use M5. This question is not going so well. I cannot figure out how to plot the distribution of yield in the terminal nodes of an M5 model. It doesn’t totally fulfill the question’s request to plot the optimal single tree, nonetheless I am going to work with CART as I know that rpart.plot
will plot the yield.
rpart.plot(model_cart$finalModel, main = "CART Tree")
That’s very surprising. The first split is on ManufacturingProcess32
and it is the only split? This doesn’t seem like it can be right, but I don’t see anything wrong. If we look at the variable importance for CART:
varImp(model_cart)$importance |>
as.data.frame() |>
filter(Overall > 0) |>
arrange(desc(Overall)) |>
print()
Overall
BiologicalMaterial03 100.00000
BiologicalMaterial06 95.81838
ManufacturingProcess32 87.00209
BiologicalMaterial12 55.15929
ManufacturingProcess36 52.02182
BiologicalMaterial02 51.14444
BiologicalMaterial08 46.10175
BiologicalMaterial11 44.85281
We see that ManufacturingProcess32
is the most important predictor; so, it makes sense that it would be the first split. But, there are four subsequent variables that are significant. Let’s assume that it is correct. The distribution of yield in the terminal nodes is indicated in the plot. It’s a 60/40 split. No, it doesn’t provide any additional knowledge about the biological or process predictors and their relationship with yield.