DATA 624 Homework #9

Exercise 8.1

In this exercise we work with the simulated dataset from Exercise 7.2. We use the mlbench.friedman1 function from the mlbench library to simulate data for ten predictor variables \(V_1\) through \(V_{10}\) and one response variable \(y\) from the following nonlinear equation:

\[y = 10 \sin (\pi V_1 V_2) + 20 (V_3 -0.5)^2 + 10 V_4 + 5 V_5 + \epsilon\]

where the \(V_j\) are random variables uniformly distributed on [0, 1] and the error term \(\epsilon \sim N(0, \sigma^2)\) is normally distributed. Note that only the first five \(V_j\) variables enter the equation for the response \(y\); the remaining \(V_j\) variables are non-informative / noise variables. Our sample size is 200.

library(mlbench)  
set.seed(200)  
simulated <- mlbench.friedman1(200, sd = 1)  
simulated <- cbind(simulated$x, simulated$y)  
simulated <- as.data.frame(simulated)  
colnames(simulated)[ncol(simulated)] <- "y"

(a) Base random forest model

For the first model, we fit a random forest model to the data using the randomForest function from the library of the same name. We use the following parameter values:

mtry (number of randomly sampled variables considered at each split): default value, which in this case is 3
ntree (number of trees to grow): 1000.

The variable importance scores are output below. We see that the most important variables in the random forest model are V1, V4, V2, and V5; none of the non-informative predictors (V6 through V10) are significant in the model.

set.seed(1012)
model1 <- randomForest(y ~ ., data = simulated,  
                       importance = TRUE,  
                       ntree = 1000, mtry = 3)  
model1

## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000, mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.592795
##                     % Var explained: 72.96

plot(model1, main = "Error profile of random forest model1")

rfImp1 <- varImp(model1, scale = FALSE) 
rfImp1 %>% kable(digits = 3, 
                 caption = "Variable importance scores for random forest model")

Variable importance scores for random forest model
	Overall
V1	8.843
V2	6.643
V3	0.670
V4	7.707
V5	2.225
V6	0.215
V7	-0.004
V8	-0.091
V9	0.031
V10	-0.015

(b) Additional correlated predictors

Next we add a predictor duplicate1 that is highly correlated with V1:

\[\text{duplicate1} = V_1 + \nu\]

where \(\nu \sim N(0, 0.1^2)\) is normally distributed noise with mean 0 and standard deviation 0.1. We fit another random forest model to the data, this time including the duplicate1 variable, using the same parameters used previously.

The variable importance scores are shown below. We see that the importance score for V1 has changed, and in fact has been diluted by the effect of the correlated predictor. For instance, in the first RF model, the importance score for V1 is ~ 8.7, whereas in the second RF model, the influence of V1 has been divided into V1 (with score ~ 5.7) and duplicate1 (with score ~ 4.3).

set.seed(200)  
simulated$duplicate1 <- simulated$V1 + rnorm(200) * 0.1  
cat("Correlation between duplicate1 and V1:", cor(simulated$duplicate1, simulated$V1))

## Correlation between duplicate1 and V1: 0.9497025

set.seed(1012)
model2 <- randomForest(y ~ ., data = simulated,  
                       importance = TRUE,  
                       ntree = 1000, mtry = 3)  
model2

## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000, mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 6.821096
##                     % Var explained: 72.03

rfImp2 <- varImp(model2, scale = FALSE) 
rfImp <- merge(rfImp1, rfImp2, by = "row.names", all = TRUE, sort = FALSE)
colnames(rfImp) <- c("Variable", "RF", "RF_dup")
rfImp %>% kable(digits = 3, 
                caption = "Variable importance scores for random forest model")

Variable importance scores for random forest model
Variable	RF	RF_dup
V1	8.843	5.830
V2	6.643	6.043
V3	0.670	0.639
V4	7.707	6.985
V5	2.225	2.154
V6	0.215	0.164
V7	-0.004	-0.055
V8	-0.091	-0.010
V9	0.031	-0.070
V10	-0.015	-0.011
duplicate1	NA	3.935

We expect that adding another predictor that is highly correlated with V1 will dilute the importance score even more. We illustrate by adding the predictor duplicate2:

\[\text{duplicate2} = \sin(2 \pi V_1) + \nu\]

where \(\nu \sim N(0, 0.1^2)\). We see that adding duplicate2 has diluted the importance scores of both V1 and duplicate1.

set.seed(200)  
simulated$duplicate2 <- sin(2 * pi * simulated$V1) + rnorm(200) * 0.1  
cat("Correlation between duplicate2 and V1:", 
    cor(simulated$duplicate2, simulated$V1))

## Correlation between duplicate2 and V1: -0.7838157

cat("Correlation between duplicate2 and duplicate1:", 
    cor(simulated$duplicate2, simulated$duplicate1))

## Correlation between duplicate2 and duplicate1: -0.7066676

set.seed(1012)
model3 <- randomForest(y ~ ., data = simulated,  
                       importance = TRUE,  
                       ntree = 1000, mtry = 3)  
model3

## 
## Call:
##  randomForest(formula = y ~ ., data = simulated, importance = TRUE,      ntree = 1000, mtry = 3) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 7.329723
##                     % Var explained: 69.94

rfImp3 <- varImp(model3, scale = FALSE) 
rfImp <- merge(rfImp, rfImp3, by.x = "Variable", by.y = "row.names", 
               all = TRUE, sort = FALSE)
colnames(rfImp)[ncol(rfImp)] <- c("RF_dup2")
rfImp %>% kable(digits = 3, 
                caption = "Variable importance scores for random forest model")

Variable importance scores for random forest model
Variable	RF	RF_dup	RF_dup2
V1	8.843	5.830	4.896
V2	6.643	6.043	5.750
V3	0.670	0.639	0.545
V4	7.707	6.985	6.538
V5	2.225	2.154	2.065
V6	0.215	0.164	0.189
V7	-0.004	-0.055	-0.101
V8	-0.091	-0.010	-0.045
V9	0.031	-0.070	0.036
V10	-0.015	-0.011	-0.002
duplicate1	NA	3.935	3.385
duplicate2	NA	NA	1.888

(c) Random forest model with conditional inference trees

Now we fit a random forest model with conditional inference trees to the data, using the cforest function from the party library. We use the controls argument to specify the same parameters used above for the base RF model:

mtry: 3
ntree: 1000.

We summarize the importance scores below, showing both the traditional and the conditional scores, denoted by “c”. We follow the same framework used above for the base RF model, in which we start with the 10 predictors, then add the first correlated variable (duplicate1), and then add the second correlated variable (duplicate2). Whether viewing the traditional or conditional importance scores, we see the same pattern observed for the base RF model:

Adding duplicate1 dilutes the importance score for V1
Adding duplicate2 dilutes the importance scores for V1 and duplicate1.

# first model without correlated variables
set.seed(1012)
cfor1 <- cforest(y ~ ., data = simulated[- (12:13)],  # exclude duplicate1 & duplicate2
                 controls = cforest_unbiased(ntree = 1000, mtry = 3))
cfor1

## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  1000 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10 
## Number of observations:  200

cforImp1 <- varimp(cfor1, conditional = FALSE)
cforImp1c <- varimp(cfor1, conditional = TRUE)
cforImp <- cbind(cforImp1, cforImp1c)

# include duplicate1
set.seed(1012)
cfor2 <- cforest(y ~ ., data = simulated[- 13],   # exclude duplicate2
                 controls = cforest_unbiased(ntree = 1000, mtry = 3))
cfor2

## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  1000 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, duplicate1 
## Number of observations:  200

cforImp2 <- varimp(cfor2, conditional = FALSE)
cforImp2c <- varimp(cfor2, conditional = TRUE)
cforImp <- merge(cforImp, cbind(cforImp2, cforImp2c), by = "row.names", 
                 all = TRUE, sort = FALSE)

# include duplicate2
set.seed(1012)
cfor3 <- cforest(y ~ ., data = simulated, 
                 controls = cforest_unbiased(ntree = 1000, mtry = 3))
cfor3

## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  1000 
## 
## Response:  y 
## Inputs:  V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, duplicate1, duplicate2 
## Number of observations:  200

cforImp3 <- varimp(cfor3, conditional = FALSE)
cforImp3c <- varimp(cfor3, conditional = TRUE)
cforImp <- merge(cforImp, cbind(cforImp3, cforImp3c), 
                 by.x = "Row.names", by.y = "row.names", 
                 all = TRUE, sort = FALSE)

# compile importance scores
colnames(cforImp) <- c("Variable", 
                       paste0("CFOR", c("", "c", "_dup", "c_dup", "_dup2", "c_dup2"))) 
cforImp %>% kable(digits = 3, 
                  caption = "Variable importance scores for conditional random forest model")

Variable importance scores for conditional random forest model
Variable	CFOR	CFORc	CFOR_dup	CFORc_dup	CFOR_dup2	CFORc_dup2
V1	7.586	4.454	5.387	2.366	4.756	1.869
V2	5.396	4.103	5.077	3.620	4.594	3.356
V3	0.132	0.121	0.053	0.038	0.087	0.080
V4	6.692	5.012	5.802	4.609	5.717	4.295
V5	1.928	1.124	1.685	1.145	1.639	0.950
V6	-0.056	-0.022	-0.001	0.013	0.028	0.015
V7	0.027	-0.037	-0.039	-0.026	-0.030	-0.005
V8	0.003	0.006	-0.051	-0.007	-0.039	-0.010
V9	-0.047	0.003	-0.020	-0.011	-0.016	-0.017
V10	-0.101	-0.008	-0.035	-0.027	-0.070	-0.014
duplicate1	NA	NA	3.509	1.734	2.558	1.119
duplicate2	NA	NA	NA	NA	1.701	0.328

(d) Other tree models

Next we repeat the same process with the boosted trees and Cubist models.

For the boosted trees model, we use the gbm function from the library of the same name. We specify the same number of trees used previously for the base RF and CFOR models, with other parameters set to their default values:

n.trees: 1000
interaction.depth: 1
shrinkage: 0.1
bag.fraction: 0.5.

From the table of variable importance scores below, we see the same pattern that was observed previously for the base RF and conditional RF models:

Adding duplicate1 dilutes the importance score for V1
Adding duplicate2 dilutes the importance scores for V1 and duplicate1.

### boosted trees model
# first model without correlated variables
set.seed(1012)
boost1 <- gbm(y ~ ., data = simulated[- (12:13)],  # exclude duplicate1 & duplicate2
              distribution = "gaussian", 
              n.trees = 1000)
boost1

## gbm(formula = y ~ ., distribution = "gaussian", data = simulated[-(12:13)], 
##     n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 10 predictors of which 10 had non-zero influence.

boostImp1 <- summary(boost1, plotit = FALSE, order = FALSE)

# include duplicate1
set.seed(1012)
boost2 <- gbm(y ~ ., data = simulated[- 13],   # exclude duplicate2
              distribution = "gaussian", 
              n.trees = 1000)
boost2

## gbm(formula = y ~ ., distribution = "gaussian", data = simulated[-13], 
##     n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 11 predictors of which 11 had non-zero influence.

boostImp2 <- summary(boost2, plotit = FALSE, order = FALSE)
boostImp <- merge(boostImp1, boostImp2, by = "var", 
                  all = TRUE, sort = FALSE)

# include duplicate2
set.seed(1012)
boost3 <- gbm(y ~ ., data = simulated,
              distribution = "gaussian", 
              n.trees = 1000)
boost3

## gbm(formula = y ~ ., distribution = "gaussian", data = simulated, 
##     n.trees = 1000)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 12 predictors of which 12 had non-zero influence.

boostImp3 <- summary(boost3, plotit = FALSE, order = FALSE)
boostImp <- merge(boostImp, boostImp3, by = "var", 
                  all = TRUE, sort = FALSE)

# compile importance scores
colnames(boostImp) <- c("Variable", 
                        paste0("BOOST", c("", "_dup", "_dup2"))) 
boostImp %>% kable(digits = 3, 
                   caption = "Variable importance scores for boosted trees model")

Variable importance scores for boosted trees model
Variable	BOOST	BOOST_dup	BOOST_dup2
V1	22.335	17.547	16.978
V2	19.978	19.802	19.371
V3	10.048	10.107	10.176
V4	25.167	24.933	24.850
V5	11.465	10.971	10.748
V6	2.501	2.459	2.145
V7	3.147	2.763	2.404
V8	1.514	1.361	1.474
V9	1.862	1.954	1.744
V10	1.983	1.756	1.635
duplicate1	NA	6.347	6.229
duplicate2	NA	NA	2.246

For the Cubist model, we use the cubist function from the Cubist library. We set the number of committees but use the default values for all other parameters, including:

committees: 100
neighbors: not used for fitting.

We see that the effect of adding the correlated predictors to the Cubist model is slightly different compared to the previous models:

First the baseline model includes one non-informative variable (V6) that is borderline-significant; its importance score is almost 40% that of the weakest informative variable (V5)
Adding duplicate1 has almost no effect on the importance scores for V1, the other informative predictors, or the significant non-informative variable (V6); however, it causes a slight increase in the score for another of the non-informative predictors (V8). Interestingly, the importance score for duplicate1 remains insignificant.
Adding duplicate2 dilutes the importance score for V1, and in addition, has a slight impact on the other informative variables. Again of interest, duplicate2 has a higher importance score than duplicate1 even though it is more weakly correlated to V1.

The relationship between adding the correlated predictors and their importance scores (as well as their impact on the importance scores of the other predictors) seems unusual, and is worth further investigation.

### cubist model
# first model without correlated variables
set.seed(1012)
cub1 <- cubist(x = simulated[- (11:13)],  # exclude duplicate1 & duplicate2
               y = simulated$y, 
               committees = 100)
cub1

## 
## Call:
## cubist.default(x = simulated[-(11:13)], y = simulated$y, committees = 100)
## 
## Number of samples: 200 
## Number of predictors: 10 
## 
## Number of committees: 100 
## Number of rules per committee: 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4 ...

cubImp1 <- varImp(cub1)

# include duplicate1
set.seed(1012)
cub2 <- cubist(x = simulated[- c(11, 13)],  # exclude duplicate2
               y = simulated$y, 
               committees = 100)
cub2

## 
## Call:
## cubist.default(x = simulated[-c(11, 13)], y = simulated$y, committees
##  = 100)
## 
## Number of samples: 200 
## Number of predictors: 11 
## 
## Number of committees: 100 
## Number of rules per committee: 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 2, 4 ...

cubImp2 <- varImp(cub2)
cubImp <- merge(cubImp1, cubImp2, by = "row.names", 
                  all = TRUE, sort = FALSE)

# include duplicate2
set.seed(1012)
cub3 <- cubist(x = simulated[- 11],  
               y = simulated$y, 
               committees = 100)
cub3

## 
## Call:
## cubist.default(x = simulated[-11], y = simulated$y, committees = 100)
## 
## Number of samples: 200 
## Number of predictors: 12 
## 
## Number of committees: 100 
## Number of rules per committee: 1, 4, 1, 5, 1, 1, 4, 5, 5, 1, 4, 2, 4, 2, 6, 1, 4, 1, 4, 1 ...

cubImp3 <- varImp(cub3)
cubImp <- merge(cubImp, cubImp3, 
                by.x = "Row.names", by.y = "row.names", 
                  all = TRUE, sort = FALSE)

# compile importance scores
colnames(cubImp) <- c("Variable", 
                      paste0("CUBIST", c("", "_dup", "_dup2"))) 
cubImp %>% kable(digits = 3, 
                 caption = "Variable importance scores for Cubist model")

Variable importance scores for Cubist model
Variable	CUBIST	CUBIST_dup	CUBIST_dup2
V1	71.5	71.0	65.0
V3	47.0	46.5	49.0
V2	58.5	59.0	62.0
V4	48.0	48.0	46.0
V5	33.0	32.5	30.5
V6	13.0	12.0	12.0
V7	0.0	0.0	0.0
V8	0.0	1.0	4.0
V9	0.0	0.0	0.0
V10	0.0	0.0	0.0
duplicate1	NA	0.0	1.0
duplicate2	NA	NA	5.5

Exercise 8.2

We start by simulating several random variables with a sample size of 500:

\(x \sim U(0, 1)\): informative variable uniformly distributed on [0, 1]
\(z \sim N(0, 4)\): noise variable normally distributed with mean 0 and standard deviation 2
\(e \sim N(0, 16)\): error variable normally distributed with mean 0 and standard deviation 4

Then we define the response variable \(y\) to be related to the informative variable \(x\) through a sine function with error term:

\[y = \sin(2 \pi x) + e\]

which is plotted below.

# random variables
set.seed(227)
x <- runif(500)            # informative variable
z <- rnorm(500, sd = 2)    # noise variable
e <- rnorm(500, sd = 4)    # error

# response = function of x + error
f <- function(x) sin(2 * pi * x) + e
y <- f(x)
plot(x, y, main = "y = sin(2*pi*x) + e")
lines(sort(x), sin(2 * pi * sort(x)), col = 'red', lwd = 2)

Next we consider different granularity levels for the informative variable by rounding the x-values into different buckets, which is equivalent to partitioning the interval [0, 1] into different-sized increments (i.e., nearest increments of 1, 1/2, 1/3, etc.):

\(x_1 = \text{round}(x)\)
\(x_2 = \text{round}(2 x) / 2\)
\(x_3 = \text{round}(3 x) / 3\)
\(x_n = \text{round}(n x) / n\)

We can see from the pairs plot below that \(x_1\) rounds to binary data (0’s and 1’s), \(x_3\) rounds to the nearest third, and so on; by the time we get to \(x_{10}\), the distribution almost resembles the continuous distribution for \(x\).

# different granularity of informative variable
x1 <- round(x)
x2 <- round(2*x) / 2
x3 <- round(3*x) / 3
x5 <- round(5*x) / 5
x10 <- round(10*x) / 10
x20 <- round(20*x) / 20
x50 <- round(50*x) / 50
x100 <- round(100*x) / 100
xx <- data.frame(x1, x2, x3, x5, x10, x20, x50, x100, x)
# plot distributions
ggpairs(xx[c(1,3,4,5,9)]) + labs(title = "Pairs plot of selected x_j")

Now for different levels of granularity for \(x_j\), we do the following:

Generate the response \(y\) as a function of \(x_j\) using the formula above
Fit a single regression tree model for \(y\) on \(x_j\) and \(z\) (using the rpart function)
Record the variable importance scores for \(x_j\) and \(z\).

The results are shown below. We see that for \(x_1\) through \(x_5\), the variable importance scores are higher for the noise variable \(z\) than for the informative variable \(x_j\); whereas for \(x_{10}\) through \(x\), the pattern reverses. In fact, when \(x\) is binary (\(x_1\)) or rounded to the nearest 1/2 (\(x_2\)), the model is fitted solely on the \(z\) noise variable. This illustrates that for low levels of granularity (i.e., fewer values) in the informative variable, the regression tree is biased to select the continuous noise variable over the informative variable.

imp <- matrix(nrow = 2, ncol = ncol(xx))
colnames(imp) <- colnames(xx)
rownames(imp) <- c("x", "z")
for (j in (1:ncol(xx))) {
    fit <- rpart(f(xx[[j]]) ~ xx[[j]] + z)
    imp[ , j] <- as.matrix(varImp(fit))
}
imp %>% kable(digits = 3, 
              caption = "Variable importance scores for different granularity levels of x")

Variable importance scores for different granularity levels of x
	x1	x2	x3	x5	x10	x20	x50	x100	x
x	0.000	0.001	0.033	0.034	0.032	0.052	0.061	0.285	0.116
z	0.022	0.022	0.049	0.047	0.011	0.040	0.048	0.095	0.097

Below we show the tree diagram for the case where \(x\) is rounded to the nearest 1/2 (\(x_2\)); in this instance, the tree splits are specified in terms of only the noise variable \(z\).

plot(as.party(rpart(f(x2) ~ x2 + z)), main = "Tree diagram for x2")

Exercise 8.3

First let’s define terms, within the stochastic gradient boosting context:

Interaction depth = depth of the regression tree that is fitted at each iteration step. Lower values correspond to weaker / less complex learners, which are typically more effective for constructing the boosting model.
Bagging fraction = fraction (from 0 to 1) of the training set that is randomly sampled for model fitting at each iteration step. Lower values of the bagging fraction helps to reduce prediction variance, thereby improving prediction accuracy.
Learning rate = multiplier (from 0 to 1) of the predicted value in an iteration step that is added to the predicted value from the previous step. Lower values of the learning rate is equivalent to regularization, which helps mitigate the tendency to over-fit the training data.

(a) Variable importance

The model on the right has a higher bagging fraction (0.9) and higher learning rate (0.9) than the model on the left (where both parameters equal 0.1). This means that the model on the right most likely is over-fitting the training set, in that at each iteration stage, the model fitting is determined by the optimal predictors on most (0.9x) of the dataset and most of this learning effect (0.9x) is inherited in the next iteration step. Combined, these factors suggest that variable significance will be more concentrated in the second model (on the right) than in the first model (on the left).

(b) Predictive performance

For the same reasons given above, the model on the left (where both the bagging fraction and the learning rate equal 0.1) will most likely generalize better and have stronger predictive performance for new samples.

(c) Interaction depth

Increasing the interaction depth will most likely decrease the slope of the variable importance for either model. The reason for this is that increasing the interaction depth produces stronger / more complex learners at each iteration step, which will include more variables in each learner. As a result the final model will likely include a greater diversity of variables that are influential in the model, so that importance scores are less concentrated in the top predictors.

Exercise 8.7

In this exercise we work with the ChemicalManufacturingProcess dataset using the caret library. We apply the same imputation, data splitting, and pre-processing steps that we used in Exercise 6.3 (Homework 7) and Exercise 7.5 (Homework 8):

Imputation for missing values: bagged tree method using the preProcess function
Data splitting: 75/25 partition into training and test sets using the createDataPartition function
Pre-processing: remove near-zero variance predictors (only 1 variable) and center and scale all remaining predictors.

We then fit and evaluate four models using the following algorithms and tuning parameters:

Regression tree (CART): “rpart” method from the rpart library; tune cp (complexity) parameter
Conditional inference tree (CIT): “ctree” method from the party library; tune mincriterion (1 - P-value threshold) parameter
Model tree (M5): “M5” method from the RWeka package; tune pruned, smoothed, and rules parameters while holding M (minimum sample size to split) parameter equal to 10
Bagged regression tree (BAG): “treebag” method from the ipred library; no parameter tuning with 50 bootstrap samples
Random forest (RF): “rf” method from randomForest library; tune mtry (number of randomly sampled predictors for each split) while holding ntrees (number of bootstrap samples) parameter equal to 500
Stochastic gradient boosted trees (SGB): “gbm” method from the gbm library; tune n.trees (number of boosting iterations), interaction.depth (maximum tree depth), and shrinkage parameters while holding n.minobsinnode (minimum terminal node size) parameter equal to 10
Cubist (CUB): “cubist” method from the Cubist library; tune committees (number of committees) and neighbors (number of nearest instances) parameters.

We specify a common resampling and tuning methodology for all models within the train function:

Resampling method: 10-fold cross-validation
Tuning method: tuning parameters are chosen to minimize RMSE on the holdout samples.

For each model, we show below the re-sampling output, the tuning profile, the final model summary, and the variable importance plot.

### load data
data("ChemicalManufacturingProcess")
#str(ChemicalManufacturingProcess)
# for ease of manipulation, split response and predictors & relabel columns
yield <- ChemicalManufacturingProcess[1]
cmp <- ChemicalManufacturingProcess[-1]
colnames(cmp) <- c(paste0("B", 1:12), paste0("M", 1:45)) 
#summary(yield)
#summary(cmp)
yield <- yield[[1]]     # numeric
cmp <- as.matrix(cmp)   # matrix

### impute missing values
set.seed(502)
impute <- preProcess(cmp, method = "bagImpute")
cmp <- predict(impute, cmp)
#summary(cmp)

### training/test split
set.seed(314)
train_idx <- createDataPartition(yield, p = 0.75, list = FALSE)
yield_train <- yield[train_idx]
yield_test <- yield[-train_idx]
cmp_train <- cmp[train_idx, ]
cmp_test <- cmp[-train_idx, ]
#summary(yield_train)
#summary(yield_test)

### pre-processing and resampling method
#prep <- c("BoxCox", "center", "scale", "nzv")  <<< BOX-COX CAUSES NaN / Inf ERRORS
prep <- c("center", "scale", "nzv")
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)  <<< TAKES TOO LONG
ctrl <- trainControl(method = "cv", number = 10)

##############################################################################
# fit rpart model; tune complexity
##############################################################################
# fit & tune model
set.seed(1012)
rpartModel <- train(x = cmp_train, 
                    y = yield_train, 
                    preProcess = prep, 
                    trControl = ctrl, 
                    method = "rpart", 
                    tuneLength = 15)
rpartModel

## CART 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.00000000  1.644582  0.3969930  1.286968
##   0.02666366  1.642595  0.4024115  1.280128
##   0.05332731  1.558834  0.4224409  1.211878
##   0.07999097  1.591814  0.3893508  1.254702
##   0.10665463  1.595145  0.3481610  1.268602
##   0.13331829  1.628173  0.3350098  1.282216
##   0.15998194  1.629326  0.3338690  1.295754
##   0.18664560  1.629326  0.3338690  1.295754
##   0.21330926  1.629326  0.3338690  1.295754
##   0.23997291  1.629326  0.3338690  1.295754
##   0.26663657  1.629326  0.3338690  1.295754
##   0.29330023  1.629326  0.3338690  1.295754
##   0.31996389  1.629326  0.3338690  1.295754
##   0.34662754  1.704395  0.2909449  1.364751
##   0.37329120  1.842859  0.1814291  1.465252
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.05332731.

ggplot(rpartModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", rpartModel$modelInfo$label))

# final model & variable importance
plot(as.party(rpartModel$finalModel), 
     main = paste0("Tree diagram: ", rpartModel$modelInfo$label))

ggplot(varImp(rpartModel), top = 20) + 
  labs(title = paste0("Variable importance: ", rpartModel$modelInfo$label))

# test set predictions
rpartPred <- predict(rpartModel, newdata = cmp_test)  
rpartPerf <- postResample(pred = rpartPred, obs = yield_test) 

##############################################################################
# fit ctree model; tune mincriterion
##############################################################################
# fit & tune model
set.seed(1012)
ctreeModel <- train(x = cmp_train, 
                    y = yield_train, 
                    preProcess = prep, 
                    trControl = ctrl, 
                    method = "ctree", 
                    tuneLength = 15)
ctreeModel

## Conditional Inference Tree 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  RMSE      Rsquared   MAE     
##   0.01          1.520954  0.4542141  1.233950
##   0.08          1.516247  0.4583815  1.231029
##   0.15          1.518616  0.4553296  1.221299
##   0.22          1.536211  0.4436224  1.233415
##   0.29          1.526104  0.4370691  1.227696
##   0.36          1.549873  0.4229506  1.240391
##   0.43          1.561818  0.4088292  1.255816
##   0.50          1.561818  0.4088292  1.255816
##   0.57          1.569959  0.3981830  1.262282
##   0.64          1.559189  0.4067986  1.248650
##   0.71          1.571693  0.3962898  1.263112
##   0.78          1.575295  0.3919013  1.270514
##   0.85          1.631148  0.3515523  1.331550
##   0.92          1.624372  0.3538730  1.320035
##   0.99          1.647913  0.3331788  1.328366
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mincriterion = 0.08.

ggplot(ctreeModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", ctreeModel$modelInfo$label))

# final model & variable importance
plot(ctreeModel$finalModel, 
     main = paste0("Tree diagram: ", ctreeModel$modelInfo$label))

ggplot(varImp(ctreeModel), top = 20) + 
  labs(title = paste0("Variable importance: ", ctreeModel$modelInfo$label))

# test set predictions
ctreePred <- predict(ctreeModel, newdata = cmp_test)  
ctreePerf <- postResample(pred = ctreePred, obs = yield_test) 

##############################################################################
# fit m5 model; tune pruned, smoothed & rules
##############################################################################
# fit & tune model
set.seed(1012)
m5Model <- train(x = cmp_train, 
                 y = yield_train, 
                 preProcess = prep, 
                 trControl = ctrl, 
                 method = "M5", 
                 control = Weka_control(M = 10))
m5Model

## Model Tree 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   pruned  smoothed  rules  RMSE      Rsquared   MAE     
##   Yes     Yes       Yes    1.379771  0.5343513  1.104119
##   Yes     Yes       No     1.283890  0.5760348  1.074251
##   Yes     No        Yes    1.390627  0.5345356  1.109995
##   Yes     No        No     1.405642  0.5115142  1.147530
##   No      Yes       Yes    1.411207  0.5346730  1.114986
##   No      Yes       No     1.366413  0.5619962  1.108823
##   No      No        Yes    1.655949  0.4389937  1.287995
##   No      No        No     1.685049  0.3763196  1.286099
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were pruned = Yes, smoothed = Yes
##  and rules = No.

plot(m5Model, main = paste0("Tuning profile: ", m5Model$modelInfo$label))

# final model & variable importance
m5Model$finalModel

## M5 pruned model tree:
## (using smoothed linear models)
## 
## M32 <= 0.132 : 
## |   B12 <= -0.604 : LM1 (31/38.044%)
## |   B12 >  -0.604 : 
## |   |   M17 <= -0.27 : LM2 (18/50.366%)
## |   |   M17 >  -0.27 : LM3 (26/30.379%)
## M32 >  0.132 : 
## |   M6 <= 0.265 : 
## |   |   B11 <= -0.349 : 
## |   |   |   B3 <= -0.342 : LM4 (4/52.545%)
## |   |   |   B3 >  -0.342 : LM5 (8/42.649%)
## |   |   B11 >  -0.349 : 
## |   |   |   M31 <= -0.162 : LM6 (6/30.694%)
## |   |   |   M31 >  -0.162 : 
## |   |   |   |   M15 <= 0.107 : LM7 (10/20.291%)
## |   |   |   |   M15 >  0.107 : LM8 (6/10.246%)
## |   M6 >  0.265 : LM9 (23/47.538%)
## 
## LM num: 1
## .outcome = 
##  0.0508 * B3 
##  + 0.1542 * B9 
##  - 0.0376 * B11 
##  + 0.1444 * B12 
##  - 0.0752 * M2 
##  + 0.6592 * M6 
##  - 0.3502 * M17 
##  + 0.0626 * M30 
##  + 0.1609 * M32 
##  - 5.2393 * M39 
##  + 40.4592
## 
## LM num: 2
## .outcome = 
##  0.0508 * B3 
##  - 0.0376 * B11 
##  + 0.1231 * B12 
##  - 0.1564 * M2 
##  + 0.4254 * M6 
##  - 0.6111 * M17 
##  + 0.0626 * M30 
##  + 0.1609 * M32 
##  + 0.0377 * M39 
##  + 40.1661
## 
## LM num: 3
## .outcome = 
##  0.0508 * B3 
##  - 0.0376 * B11 
##  + 0.1231 * B12 
##  - 0.1373 * M2 
##  + 0.1254 * M6 
##  - 0.5497 * M17 
##  - 0.1513 * M22 
##  + 0.0626 * M30 
##  + 0.1609 * M32 
##  + 0.0377 * M39 
##  + 39.5935
## 
## LM num: 4
## .outcome = 
##  0.6197 * B3 
##  - 0.047 * B11 
##  + 0.0596 * B12 
##  + 0.0442 * M6 
##  + 0.331 * M15 
##  - 2.0083 * M16 
##  - 0.2943 * M17 
##  + 0.0782 * M30 
##  - 0.159 * M31 
##  + 0.2011 * M32 
##  + 0.0471 * M39 
##  + 40.8813
## 
## LM num: 5
## .outcome = 
##  0.5441 * B3 
##  - 0.047 * B11 
##  + 0.0596 * B12 
##  + 0.0442 * M6 
##  + 0.331 * M15 
##  - 2.0083 * M16 
##  - 0.2943 * M17 
##  + 0.0782 * M30 
##  - 0.159 * M31 
##  + 0.2011 * M32 
##  + 0.0471 * M39 
##  + 41.1235
## 
## LM num: 6
## .outcome = 
##  0.1848 * B3 
##  - 0.047 * B11 
##  + 0.0596 * B12 
##  + 0.0442 * M6 
##  + 0.2415 * M15 
##  - 1.4655 * M16 
##  - 0.2943 * M17 
##  + 0.0782 * M30 
##  - 0.2472 * M31 
##  + 0.2011 * M32 
##  + 0.0471 * M39 
##  + 40.9585
## 
## LM num: 7
## .outcome = 
##  0.1848 * B3 
##  - 0.047 * B11 
##  + 0.0596 * B12 
##  + 0.0442 * M6 
##  + 0.1952 * M15 
##  - 1.2018 * M16 
##  - 0.2943 * M17 
##  + 0.0782 * M30 
##  - 0.2188 * M31 
##  + 0.2011 * M32 
##  + 0.0471 * M39 
##  + 40.8325
## 
## LM num: 8
## .outcome = 
##  0.1848 * B3 
##  - 0.047 * B11 
##  + 0.0596 * B12 
##  + 0.0442 * M6 
##  + 0.1864 * M15 
##  - 1.4655 * M16 
##  - 0.2943 * M17 
##  + 0.0782 * M30 
##  - 0.2188 * M31 
##  + 0.2011 * M32 
##  + 0.0471 * M39 
##  + 40.7556
## 
## LM num: 9
## .outcome = 
##  0.2199 * B3 
##  + 0.1937 * B11 
##  + 0.0596 * B12 
##  + 0.0442 * M6 
##  - 0.7 * M17 
##  + 0.0782 * M30 
##  - 0.205 * M31 
##  + 0.2011 * M32 
##  + 0.0471 * M39 
##  + 41.3458
## 
## Number of Rules : 9

#plot(m5Model$finalModel, 
#     main = paste0("Tree diagram: ", m5Model$modelInfo$label))
# error with tree diagram so plot manually: pruned = Yes, smoothed = Yes
m5plot <- M5P(yield_train ~ ., data = data.frame(yield_train, cmp_train), 
              control = Weka_control(M = 10, N = FALSE, U = FALSE))
plot(m5plot, main = "Tree diagram: M5 (raw predictors)")

ggplot(varImp(m5Model), top = 20) + 
  labs(title = paste0("Variable importance: ", m5Model$modelInfo$label))

# test set predictions
m5Pred <- predict(m5Model, newdata = cmp_test)  
m5Perf <- postResample(pred = m5Pred, obs = yield_test) 

##############################################################################
# fit bagged tree model; no tuning
##############################################################################
# fit & tune model
set.seed(1012)
bagModel <- train(x = cmp_train, 
                  y = yield_train, 
                  preProcess = prep, 
                  trControl = ctrl, 
                  method = "treebag", 
                  nbagg = 50)
bagModel

## Bagged CART 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.274328  0.6045216  0.9850819

#ggplot(bagModel, highlight = TRUE) + 
#  labs(title = paste0("Tuning profile: ", bagModel$modelInfo$label))

# final model & variable importance
bagModel$finalModel

## 
## Bagging regression trees with 50 bootstrap replications

ggplot(varImp(bagModel), top = 20) + 
  labs(title = paste0("Variable importance: ", bagModel$modelInfo$label))

# test set predictions
bagPred <- predict(bagModel, newdata = cmp_test)  
bagPerf <- postResample(pred = bagPred, obs = yield_test) 

##############################################################################
# fit random forest model; tune mtry
##############################################################################
# fit & tune model
rfGrid <- data.frame(mtry = seq(2, ncol(cmp) - 1, length = 10))
set.seed(1012)
rfModel <- train(x = cmp_train, 
                 y = yield_train, 
                 preProcess = prep, 
                 trControl = ctrl, 
                 method = "rf", importance = TRUE, ntree = 500,
                 tuneGrid = rfGrid)
rfModel

## Random Forest 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    1.256593  0.6753606  1.0280983
##    8    1.175725  0.6914582  0.9301040
##   14    1.162387  0.6859670  0.9118959
##   20    1.162259  0.6835265  0.9166543
##   26    1.163774  0.6805012  0.9122052
##   32    1.174409  0.6715481  0.9161034
##   38    1.183526  0.6647786  0.9201713
##   44    1.184037  0.6646573  0.9270198
##   50    1.193942  0.6577142  0.9323377
##   56    1.191157  0.6591370  0.9218105
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 20.

ggplot(rfModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", rfModel$modelInfo$label))

# final model & variable importance
plot(rfModel$finalModel, main = paste0("Error profile: ", rfModel$modelInfo$label))

ggplot(varImp(rfModel), top = 20) + 
  labs(title = paste0("Variable importance: ", rfModel$modelInfo$label))

# test set predictions
rfPred <- predict(rfModel, newdata = cmp_test)  
rfPerf <- postResample(pred = rfPred, obs = yield_test) 

##############################################################################
# fit boosted tree model; tune interaction.depth, n.trees & shrinkage
##############################################################################
# fit & tune model
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 100),
                       shrinkage = c(0.01, 0.1), n.minobsinnode = 10)
set.seed(1012)
gbmModel <- train(x = cmp_train, 
                  y = yield_train, 
                  preProcess = prep, 
                  trControl = ctrl, 
                  method = "gbm", verbose = FALSE, 
                  tuneGrid = gbmGrid)
gbmModel

## Stochastic Gradient Boosting 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE      
##   0.01       1                   100     1.498650  0.5901162  1.2186318
##   0.01       1                   200     1.321326  0.6312728  1.0622624
##   0.01       1                   300     1.246167  0.6478029  0.9895680
##   0.01       1                   400     1.208770  0.6628299  0.9533216
##   0.01       1                   500     1.192092  0.6684800  0.9388970
##   0.01       1                   600     1.181935  0.6710051  0.9288092
##   0.01       1                   700     1.168935  0.6759267  0.9194513
##   0.01       1                   800     1.156721  0.6819414  0.9051553
##   0.01       1                   900     1.151742  0.6836909  0.8981264
##   0.01       1                  1000     1.148110  0.6850797  0.8953117
##   0.01       3                   100     1.384707  0.6146350  1.1281035
##   0.01       3                   200     1.230898  0.6515550  0.9817653
##   0.01       3                   300     1.179918  0.6699248  0.9284973
##   0.01       3                   400     1.148254  0.6842008  0.9042542
##   0.01       3                   500     1.129518  0.6927197  0.8898761
##   0.01       3                   600     1.114233  0.6987633  0.8808828
##   0.01       3                   700     1.104542  0.7026256  0.8739396
##   0.01       3                   800     1.098531  0.7045647  0.8689153
##   0.01       3                   900     1.092963  0.7069225  0.8648685
##   0.01       3                  1000     1.089273  0.7083811  0.8656634
##   0.01       5                   100     1.375900  0.6263831  1.1195141
##   0.01       5                   200     1.240781  0.6477026  0.9917123
##   0.01       5                   300     1.182059  0.6672434  0.9369984
##   0.01       5                   400     1.151305  0.6816515  0.9085756
##   0.01       5                   500     1.127197  0.6917683  0.8954229
##   0.01       5                   600     1.115840  0.6958137  0.8863796
##   0.01       5                   700     1.104935  0.6996919  0.8773043
##   0.01       5                   800     1.097023  0.7033499  0.8716464
##   0.01       5                   900     1.091912  0.7065807  0.8682391
##   0.01       5                  1000     1.085987  0.7099831  0.8634081
##   0.01       7                   100     1.369840  0.6349881  1.1146840
##   0.01       7                   200     1.231200  0.6564487  0.9817977
##   0.01       7                   300     1.183379  0.6705162  0.9348219
##   0.01       7                   400     1.150104  0.6834322  0.9052179
##   0.01       7                   500     1.126963  0.6926420  0.8841507
##   0.01       7                   600     1.109488  0.7013380  0.8701628
##   0.01       7                   700     1.099563  0.7046437  0.8617788
##   0.01       7                   800     1.088393  0.7090501  0.8564584
##   0.01       7                   900     1.085948  0.7092663  0.8568966
##   0.01       7                  1000     1.080974  0.7117137  0.8536201
##   0.10       1                   100     1.151952  0.6723921  0.9107895
##   0.10       1                   200     1.139967  0.6800563  0.9062366
##   0.10       1                   300     1.152613  0.6690931  0.9350862
##   0.10       1                   400     1.152909  0.6669775  0.9494351
##   0.10       1                   500     1.149212  0.6694937  0.9490025
##   0.10       1                   600     1.160175  0.6671251  0.9497953
##   0.10       1                   700     1.166699  0.6643580  0.9552319
##   0.10       1                   800     1.167924  0.6637075  0.9548883
##   0.10       1                   900     1.168252  0.6633959  0.9564782
##   0.10       1                  1000     1.167384  0.6660688  0.9492150
##   0.10       3                   100     1.139742  0.6938809  0.9008562
##   0.10       3                   200     1.111269  0.6990786  0.8774888
##   0.10       3                   300     1.105495  0.6971149  0.8730109
##   0.10       3                   400     1.105066  0.6972192  0.8719661
##   0.10       3                   500     1.105318  0.6970177  0.8733287
##   0.10       3                   600     1.104597  0.6967659  0.8734629
##   0.10       3                   700     1.104572  0.6969158  0.8727718
##   0.10       3                   800     1.105370  0.6964025  0.8730812
##   0.10       3                   900     1.105140  0.6965465  0.8728403
##   0.10       3                  1000     1.105487  0.6962912  0.8728713
##   0.10       5                   100     1.136620  0.6841275  0.8997597
##   0.10       5                   200     1.138032  0.6802838  0.9018184
##   0.10       5                   300     1.137156  0.6815714  0.9050817
##   0.10       5                   400     1.137473  0.6806456  0.9055027
##   0.10       5                   500     1.138379  0.6799225  0.9058674
##   0.10       5                   600     1.138328  0.6798776  0.9066717
##   0.10       5                   700     1.138298  0.6801197  0.9070434
##   0.10       5                   800     1.138642  0.6799711  0.9074562
##   0.10       5                   900     1.139345  0.6796498  0.9081762
##   0.10       5                  1000     1.139497  0.6794466  0.9087549
##   0.10       7                   100     1.134180  0.6851758  0.8871431
##   0.10       7                   200     1.127551  0.6854471  0.8842233
##   0.10       7                   300     1.126379  0.6852896  0.8849238
##   0.10       7                   400     1.123417  0.6867847  0.8855397
##   0.10       7                   500     1.121551  0.6874811  0.8838705
##   0.10       7                   600     1.121040  0.6873256  0.8840421
##   0.10       7                   700     1.120442  0.6877678  0.8839037
##   0.10       7                   800     1.120144  0.6877076  0.8841339
##   0.10       7                   900     1.120146  0.6876607  0.8842620
##   0.10       7                  1000     1.120144  0.6875703  0.8843156
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000,
##  interaction.depth = 7, shrinkage = 0.01 and n.minobsinnode = 10.

ggplot(gbmModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", gbmModel$modelInfo$label))

# final model & variable importance
gbmModel$finalModel

## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## There were 56 predictors of which 55 had non-zero influence.

ggplot(varImp(gbmModel), top = 20) + 
  labs(title = paste0("Variable importance: ", gbmModel$modelInfo$label))

# test set predictions
gbmPred <- predict(gbmModel, newdata = cmp_test)  
gbmPerf <- postResample(pred = gbmPred, obs = yield_test) 

##############################################################################
# fit cubist model; tune committees & neighbors
##############################################################################
# fit & tune model
cubGrid <- expand.grid(committees = c(1:10, 15, 20), 
                      neighbors = c(0, 1, 5, 9))
set.seed(1012)
cubModel <- train(x = cmp_train, 
                  y = yield_train, 
                  preProcess = prep, 
                  trControl = ctrl, 
                  method = "cubist", 
                  tuneGrid = cubGrid)
cubModel

## Cubist 
## 
## 132 samples
##  57 predictor
## 
## Pre-processing: centered (56), scaled (56), remove (1) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 117, 120, 120, 117, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.462101  0.5445383  1.0970479
##    1          1          1.544244  0.5505422  1.0726196
##    1          5          1.413605  0.5884390  1.0127159
##    1          9          1.415435  0.5757981  1.0311758
##    2          0          1.309348  0.5861699  1.0390401
##    2          1          1.266222  0.6314397  0.9606537
##    2          5          1.214394  0.6496652  0.9264805
##    2          9          1.231811  0.6392587  0.9517074
##    3          0          1.222276  0.6277949  0.9430797
##    3          1          1.206730  0.6483130  0.9280919
##    3          5          1.155068  0.6696490  0.8670846
##    3          9          1.159427  0.6663908  0.8898611
##    4          0          1.134353  0.6683059  0.9279043
##    4          1          1.161542  0.6783434  0.8898185
##    4          5          1.088536  0.7053267  0.8511029
##    4          9          1.091623  0.7026369  0.8761153
##    5          0          1.054738  0.7031342  0.8602937
##    5          1          1.114396  0.6884811  0.8542105
##    5          5          1.023758  0.7277670  0.7972337
##    5          9          1.028384  0.7263642  0.8180578
##    6          0          1.065000  0.6938469  0.8798195
##    6          1          1.116572  0.6829143  0.8557129
##    6          5          1.022836  0.7240784  0.8036790
##    6          9          1.028574  0.7217627  0.8287710
##    7          0          1.063161  0.6930327  0.8733267
##    7          1          1.124854  0.6811572  0.8660188
##    7          5          1.030245  0.7173850  0.8072162
##    7          9          1.032430  0.7173399  0.8216513
##    8          0          1.080282  0.6862694  0.8898215
##    8          1          1.117292  0.6888184  0.8594842
##    8          5          1.048409  0.7159723  0.8276938
##    8          9          1.043188  0.7145785  0.8317350
##    9          0          1.097547  0.6852917  0.9009156
##    9          1          1.133008  0.6864410  0.8694005
##    9          5          1.059014  0.7137707  0.8285612
##    9          9          1.062247  0.7113388  0.8418928
##   10          0          1.120652  0.6782645  0.9122784
##   10          1          1.142692  0.6850253  0.8768056
##   10          5          1.069575  0.7147424  0.8364076
##   10          9          1.074259  0.7112952  0.8512635
##   15          0          1.173084  0.6689301  0.9091464
##   15          1          1.196077  0.6731132  0.8635982
##   15          5          1.132594  0.7035524  0.8256041
##   15          9          1.134516  0.6975300  0.8545451
##   20          0          1.225224  0.6544912  0.9428941
##   20          1          1.247980  0.6654064  0.8984041
##   20          5          1.182058  0.6916699  0.8516604
##   20          9          1.180900  0.6846730  0.8809144
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 6 and neighbors = 5.

ggplot(cubModel, highlight = TRUE) + 
  labs(title = paste0("Tuning profile: ", cubModel$modelInfo$label))

# final model & variable importance
cubModel$finalModel

## 
## Call:
## cubist.default(x = x, y = y, committees = param$committees)
## 
## Number of samples: 132 
## Number of predictors: 56 
## 
## Number of committees: 6 
## Number of rules per committee: 2, 1, 3, 1, 4, 1

ggplot(varImp(cubModel), top = 20) + 
  labs(title = paste0("Variable importance: ", cubModel$modelInfo$label))

# test set predictions
cubPred <- predict(cubModel, newdata = cmp_test)  
cubPerf <- postResample(pred = cubPred, obs = yield_test)

(a) Optimal model performance

We evaluate predictive performance for the models based on the root mean squared error (RMSE), R-squared, and mean absolute error (MAE) metrics. We summarize these performance metrics in the table below based on the resampled holdout sets (prefixed as “res_”) as well as the test set (prefixed as “test_”) for all 7 tree-based models.

We note several observations:

Generally the RMSE and MAE metrics are weaker on the resampled holdout sets than on the test set. This may relate to the small sample size of the holdout sets, since the training set includes 132 cases and each holdout set in the 10-fold cross-validation includes only 13 cases, whereas the test set includes 44 cases.
Overall the comparative performance of the models is mostly consistent between the resampled metrics and the test metrics. In particular, the BAG, RF, and SGB models have the strongest metrics on both the resampled holdout sets and the test set, while CART and CIT have the weakest metrics. This indicates that the ensemble methods perform better than the single trees on this dataset. Also, CUB potentially exhibits signs of over-fitting, as the resampled metrics are much stronger than the test metrics.
Based on the test set metrics, RF is the strongest model overall, while BAG and SGB are both close runners-up.

We also plot below the resampled metrics for the various models.

# resampling & test set metrics
res_perf <- rbind(CART = colMeans(rpartModel$resample[-4]), 
                  CIT = colMeans(ctreeModel$resample[-4]), 
                  M5 = colMeans(m5Model$resample[-4]), 
                  BAG = colMeans(bagModel$resample[-4]), 
                  RF = colMeans(rfModel$resample[-4]), 
                  SGB = colMeans(gbmModel$resample[-4]), 
                  CUB = colMeans(cubModel$resample[-4]))
colnames(res_perf) <- paste0("res_", colnames(res_perf))
test_perf <- rbind(CART = rpartPerf, CIT = ctreePerf, M5 = m5Perf, 
                   BAG = bagPerf, RF = rfPerf, SGB = gbmPerf, CUB = cubPerf) 
colnames(test_perf) <- paste0("test_", colnames(test_perf))
cbind(res_perf, test_perf) %>% 
  kable(digits = 3, 
        caption = "Resampled & test set performance metrics")

Resampled & test set performance metrics
	res_RMSE	res_Rsquared	res_MAE	test_RMSE	test_Rsquared	test_MAE
CART	1.559	0.422	1.212	1.310	0.429	1.072
CIT	1.516	0.458	1.231	1.364	0.359	1.116
M5	1.284	0.576	1.074	1.156	0.521	0.964
BAG	1.274	0.605	0.985	1.006	0.626	0.815
RF	1.162	0.684	0.917	0.972	0.661	0.795
SGB	1.081	0.712	0.854	1.020	0.632	0.849
CUB	1.023	0.724	0.804	1.495	0.416	0.880

# plot resampled metrics
list(CART = rpartModel, CIT = ctreeModel, M5 = m5Model, 
     BAG = bagModel, RF = rfModel, SGB = gbmModel, CUB = cubModel) %>% 
  resamples() %>% 
  bwplot(metric = c("RMSE", "Rsquared", "MAE"), 
         layout = c(3,1), 
         between = list(x = 1), 
         main = "Comparison of resampled performance metrics")

(b) Most important predictors

From the variable importance plot above, we see that the top 10 important predictors in the RF model include 6 manufacturing process and 4 biological variables:

The top 10 predictors are summarized in the table below for each of the 7 tree-based models as well as the SVM and elastic net (ENET) models, which were the optimal non-linear and linear models from the previous exercises. It is interesting to see that there is greater diversity of opinion among the tree-based models, compared to the output seen previously for the non-linear and linear models.

Reviewing variable importance for the RF model, we observe that:

Among the top 5 and top 10 variables, the manufacturing process and biological variables are relatively evenly represented. For instance, 2 of the top 5 and 4 of the top 10 are biological variables.
Overall, taking into account the relative order and magnitude of the importance scores, it appears that the manufacturing variables are more important on balance than the biological variables. The top 2 are both manufacturing variables, with importance scores that are much higher than those for the remaining variables.
While not identical, the top 10 predictors from the RF model are relatively consistent with those from the SVM and ENET models. For instance, 4 of the top 5 predictors and 8 of the top 10 predictors are the same when comparing the RF model to the SVM and ENET models. However, outside of the top 2 predictors (M32 and M13), the relative order of the remaining top 10 predictors varies between the models.

top10 <- function(mod) {
  varImp(mod)$importance %>%
  mutate(Var = paste(rownames(varImp(mod)$importance), 
                     round(Overall, 1), sep = " - ")) %>%
  arrange(desc(Overall)) %>% 
  dplyr::select(Var) %>% 
  head(10)
}
prevtop10 <- c("M32", "M13", "M17", "B6", "B3", "M9", "M36", "M6", "B12", "B2")
cbind(CART = top10(rpartModel), 
      CIT = top10(ctreeModel), 
      M5 = top10(m5Model), 
      BAG = top10(bagModel), 
      RF = top10(rfModel), 
      SGB = top10(gbmModel), 
      CUB = top10(cubModel), 
      SVM = prevtop10, 
      ENET = prevtop10) %>% 
  kable(col.names = c("CART", "CIT", "M5", "BAG", "RF", "SGB", "CUB", "SVM", "ENET"), 
        caption = "Top 10 important variables")

Top 10 important variables
CART	CIT	M5	BAG	RF	SGB	CUB	SVM	ENET
M17 - 100	M32 - 100	M32 - 100	M17 - 100	M32 - 100	M32 - 100	M32 - 100	M32	M32
M11 - 62.4	M13 - 89.9	M13 - 89.9	B12 - 94.7	M13 - 61.5	B12 - 35.1	M13 - 80	M13	M13
B12 - 57.9	M17 - 82.6	M17 - 82.6	M9 - 91.3	B12 - 58.2	M13 - 31	M17 - 61.1	M17	M17
M13 - 49.6	B6 - 70.9	B6 - 70.9	M32 - 89.4	B3 - 52	M6 - 24.5	M29 - 40	B6	B6
M25 - 49.4	B3 - 63.9	B3 - 63.9	B6 - 76.9	M17 - 49.8	M9 - 20.8	M18 - 37.8	B3	B3
M30 - 39.8	M9 - 62.9	M9 - 62.9	B3 - 74.1	B6 - 47.9	B3 - 16	M4 - 35.6	M9	M9
M18 - 37.7	M36 - 61.4	M36 - 61.4	M6 - 71.4	M39 - 41.9	M17 - 15.4	M25 - 35.6	M36	M36
M32 - 37.2	M6 - 60.2	M6 - 60.2	M31 - 62.5	M9 - 41.3	M31 - 14.8	B8 - 27.8	M6	M6
B11 - 27.4	B12 - 59.6	B12 - 59.6	M13 - 62.3	M36 - 39.9	B9 - 13.5	M20 - 22.2	B12	B12
B3 - 25.3	B2 - 51.2	B2 - 51.2	B11 - 50	B11 - 39.8	B6 - 8.7	M10 - 17.8	B2	B2

(c) Optimal single tree

The optimal single tree model based on the test set metrics above is the CART model. The tree diagram is shown below. Based on the tree splits and the distribution of yield in the terminal nodes, this model indicates the following predictor relationships with yield:

M32, B12, M6: positive relationship (higher values correspond to greater yield)
M17: negative relationship (lower values correspond to greater yield).

plot(as.party(rpartModel$finalModel), 
     main = paste0("Tree diagram: ", rpartModel$modelInfo$label))