DATA 624 HW5

library(tidyverse)
library(mlbench)
library(caret)
library(earth)
library(kernlab)
library(kableExtra)
library(AppliedPredictiveModeling)
library(corrplot)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(gridExtra)
library(doParallel)
library(rpart.plot)
library(rattle)

7.2

(a)

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10sin(\pi x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the \(x\) values are random variables uniformly distributed between \([0,1]\) (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

# number of cores for parallel processing
num_cores <- 4
cl <- makeCluster(num_cores)
registerDoParallel(cl)

set.seed(200)

trainingData <- mlbench.friedman1(200, sd=1)
## we convert the 'x' data from a matrix to a dataframe
## one reason is that this will give the columns names 
trainingData$x <- data.frame(trainingData$x)
## look at the data using 
featurePlot(trainingData$x, trainingData$y)

## or other methods

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to 
## estimate the true error rate with good precision
testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center","scale"),
                  tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set 
## performance values 
knn.results <- postResample(pred = knnPred, obs = testData$y)
knn.results
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named x1-x5)?

NEURAL NETWORK MODEL

First we remove predcictors to ensure that the maximum absolute pairwise correlation between the predictors is less than 0.75,

findCorrelation(cor(trainingData$x), cutoff = .75)
## integer(0)

No predictors exceed a pairwise correlation of .75 or greater. Next, we create a specific candidate set of models to evaluate:

# specify and store the resampling method
ctrl <- trainControl(method = "cv", 
                     allowParallel = TRUE,
                     number = 10)

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)

set.seed(100) 
nnetTune <- train(trainingData$x, trainingData$y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNwts = 10 * (ncol(trainingData$x) +  1) + 10 + 1,
                  maxit = 500)

nnetTune
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.391555  0.7623243  1.887489
##   0.00    2    2.422221  0.7544913  1.917770
##   0.00    3    2.043540  0.8224555  1.630840
##   0.00    4    2.367168  0.7999596  1.801993
##   0.00    5    2.371240  0.7841913  1.785091
##   0.00    6    2.779869  0.7443282  2.008843
##   0.00    7    3.764096  0.6207519  2.619394
##   0.00    8    6.287055  0.4726517  3.479320
##   0.00    9    4.448435  0.5936014  2.869717
##   0.00   10    3.290886  0.6356008  2.396166
##   0.01    1    2.385379  0.7602941  1.887813
##   0.01    2    2.417335  0.7524249  1.930550
##   0.01    3    2.151211  0.8016017  1.701947
##   0.01    4    2.091926  0.8154380  1.676654
##   0.01    5    2.174252  0.8008679  1.739792
##   0.01    6    2.231823  0.8093149  1.796266
##   0.01    7    2.387448  0.7723920  1.913901
##   0.01    8    2.418258  0.7789421  1.899541
##   0.01    9    2.403486  0.7719540  1.969117
##   0.01   10    2.510406  0.7320214  2.028680
##   0.10    1    2.393969  0.7596425  1.894199
##   0.10    2    2.423111  0.7535683  1.922239
##   0.10    3    2.169933  0.7982368  1.726892
##   0.10    4    2.059078  0.8224162  1.648608
##   0.10    5    1.980727  0.8378612  1.586695
##   0.10    6    2.154588  0.8087362  1.693934
##   0.10    7    2.154261  0.8175103  1.668936
##   0.10    8    2.283792  0.7916330  1.807037
##   0.10    9    2.337881  0.7769658  1.805339
##   0.10   10    2.340156  0.7672626  1.866889
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
nnetPred <- predict(nnetTune, newdata = testData$x)
nnet.results <- postResample(pred = nnetPred, obs = testData$y)
nnet.results
##      RMSE  Rsquared       MAE 
## 2.1328254 0.8228765 1.5911867

MARS MODEL (Multivariate Adaptive Regression Splines)

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

marsTuned <- train(trainingData$x, trainingData$y,
                   method = "earth",
                   # Explicitly declare the candidate models to test,
                   tuneGrid = marsGrid,
                   trControl = ctrl)

marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.327937  0.2544880  3.600474
##   1        3      3.572450  0.4912720  2.895811
##   1        4      2.596841  0.7183600  2.106341
##   1        5      2.370161  0.7659777  1.918669
##   1        6      2.276141  0.7881481  1.810001
##   1        7      1.766728  0.8751831  1.390215
##   1        8      1.780946  0.8723243  1.401345
##   1        9      1.665091  0.8819775  1.325515
##   1       10      1.663804  0.8821283  1.327657
##   1       11      1.657738  0.8822967  1.331730
##   1       12      1.653784  0.8827903  1.331504
##   1       13      1.648496  0.8823663  1.316407
##   1       14      1.639073  0.8841742  1.312833
##   1       15      1.639073  0.8841742  1.312833
##   1       16      1.639073  0.8841742  1.312833
##   1       17      1.639073  0.8841742  1.312833
##   1       18      1.639073  0.8841742  1.312833
##   1       19      1.639073  0.8841742  1.312833
##   1       20      1.639073  0.8841742  1.312833
##   1       21      1.639073  0.8841742  1.312833
##   1       22      1.639073  0.8841742  1.312833
##   1       23      1.639073  0.8841742  1.312833
##   1       24      1.639073  0.8841742  1.312833
##   1       25      1.639073  0.8841742  1.312833
##   1       26      1.639073  0.8841742  1.312833
##   1       27      1.639073  0.8841742  1.312833
##   1       28      1.639073  0.8841742  1.312833
##   1       29      1.639073  0.8841742  1.312833
##   1       30      1.639073  0.8841742  1.312833
##   1       31      1.639073  0.8841742  1.312833
##   1       32      1.639073  0.8841742  1.312833
##   1       33      1.639073  0.8841742  1.312833
##   1       34      1.639073  0.8841742  1.312833
##   1       35      1.639073  0.8841742  1.312833
##   1       36      1.639073  0.8841742  1.312833
##   1       37      1.639073  0.8841742  1.312833
##   1       38      1.639073  0.8841742  1.312833
##   2        2      4.327937  0.2544880  3.600474
##   2        3      3.572450  0.4912720  2.895811
##   2        4      2.661826  0.7070510  2.173471
##   2        5      2.404015  0.7578971  1.975387
##   2        6      2.243927  0.7914805  1.783072
##   2        7      1.856336  0.8605482  1.435682
##   2        8      1.754607  0.8763186  1.396841
##   2        9      1.603578  0.8938666  1.261361
##   2       10      1.492421  0.9084998  1.168700
##   2       11      1.317350  0.9292504  1.033926
##   2       12      1.304327  0.9320133  1.019108
##   2       13      1.277510  0.9323681  1.002927
##   2       14      1.269626  0.9350024  1.003346
##   2       15      1.266217  0.9359400  1.013893
##   2       16      1.268470  0.9354868  1.011414
##   2       17      1.268470  0.9354868  1.011414
##   2       18      1.268470  0.9354868  1.011414
##   2       19      1.268470  0.9354868  1.011414
##   2       20      1.268470  0.9354868  1.011414
##   2       21      1.268470  0.9354868  1.011414
##   2       22      1.268470  0.9354868  1.011414
##   2       23      1.268470  0.9354868  1.011414
##   2       24      1.268470  0.9354868  1.011414
##   2       25      1.268470  0.9354868  1.011414
##   2       26      1.268470  0.9354868  1.011414
##   2       27      1.268470  0.9354868  1.011414
##   2       28      1.268470  0.9354868  1.011414
##   2       29      1.268470  0.9354868  1.011414
##   2       30      1.268470  0.9354868  1.011414
##   2       31      1.268470  0.9354868  1.011414
##   2       32      1.268470  0.9354868  1.011414
##   2       33      1.268470  0.9354868  1.011414
##   2       34      1.268470  0.9354868  1.011414
##   2       35      1.268470  0.9354868  1.011414
##   2       36      1.268470  0.9354868  1.011414
##   2       37      1.268470  0.9354868  1.011414
##   2       38      1.268470  0.9354868  1.011414
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.

Viewing the most important variables,

varImp(marsTuned)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00
marsPred <- predict(marsTuned, newdata = testData$x)
mars.results <- postResample(pred = marsPred, obs = testData$y)
mars.results
##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

SVM (Support Vector Machines)

svmRTuned <- train(trainingData$x, trainingData$y,
                  method = "svmRadial",
                  preProcess = c("center","scale"),
                  tuneLength = 14,
                  trControl = ctrl)

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.490737  0.8009120  1.982118
##      0.50  2.246868  0.8153042  1.774454
##      1.00  2.051872  0.8400992  1.614368
##      2.00  1.949707  0.8534618  1.524201
##      4.00  1.886125  0.8610205  1.465373
##      8.00  1.849240  0.8654699  1.436630
##     16.00  1.834611  0.8673617  1.429808
##     32.00  1.833165  0.8675812  1.428637
##     64.00  1.833165  0.8675812  1.428637
##    128.00  1.833165  0.8675812  1.428637
##    256.00  1.833165  0.8675812  1.428637
##    512.00  1.833165  0.8675812  1.428637
##   1024.00  1.833165  0.8675812  1.428637
##   2048.00  1.833165  0.8675812  1.428637
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06315483
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06315483 and C = 32.
svmRPred <- predict(svmRTuned, newdata = testData$x)
svm.results <- postResample(pred = svmRPred, obs = testData$y)
svm.results
##      RMSE  Rsquared       MAE 
## 2.0741633 0.8255819 1.5755127

Now we can compare the final results,

rbind("nnet" = nnet.results,
      "mars" = mars.results,
      "svm" = svm.results,
      "knn" = knn.results) |>
  kable()
RMSE Rsquared MAE
nnet 2.132825 0.8228765 1.591187
mars 1.158995 0.9460418 0.925023
svm 2.074163 0.8255819 1.575513
knn 3.204060 0.6819919 2.568346

The MARS model performs best according to the RMSE.

varImp(marsTuned)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00

The MARS model does indeed select X1-X5 as the most informative variables!

7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Pre-processing

data("ChemicalManufacturingProcess")
cmp <- ChemicalManufacturingProcess

set.seed(100)
trans <- preProcess(cmp, 
                   method = c("BoxCox", "knnImpute", "center", "scale"))

# apply the transformed data
df <- predict(trans, cmp)
df$Yield <- cmp$Yield

# split the data into training and testing sets
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
train.set <- df[trainRows, ]
test.set <- df[-trainRows, ]

# set aside predictors and response 
train.x <- train.set |> select(-Yield)
train.y <- train.set$Yield
test.x <- test.set |> select(-Yield)
test.y<- test.set$Yield

Training Models

KNN MODEL

set.seed(100)
knnTuned <- train(train.x, train.y,
                  method = "knn",
                  preProcess = c("center","scale"),
                  tuneLength = 10,
                  trControl = ctrl)

knnPred <- predict(knnTuned, test.x)
knn.results <- postResample(knnPred, test.y)

NEURAL NETWORK MODEL

tooHigh <- findCorrelation(cor(train.x), cutoff = .75)
trainXnnet <- train.x[, -tooHigh]
testXnnet <- test.x[, -tooHigh]

set.seed(100)
nnetGrid <- expand.grid(size = c(1:10),
                        decay = c(0, 0.01, 0.1),
                        bag = FALSE)

nnetTuned <- train(trainXnnet, train.y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
                  maxit = 500)

nnetPred <- predict(nnetTuned, testXnnet)
nnet.results <- postResample(nnetPred, test.y)

MARS MODEL

mars.grid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)
marsTuned <- train(train.x, train.y,
                   method = "earth",
                   preProc = c("center","scale"),
                   tuneGrid = marsGrid,
                   trControl = ctrl)

marsPred <- predict(marsTuned, test.x)
mars.results <- postResample(marsPred, test.y)

SVM MODEL

set.seed(100)
svmRTuned <- train(train.x, train.y,
                   method = "svmRadial",
                   preProc = c("center","scale"),
                   tuneLength = 14,
                   trControl = ctrl)

svmRPred <- predict(svmRTuned, test.x)
svm.results <- postResample(svmRPred, test.y)

(a)

Which nonlinear regression model gives the optimal resampling and test set performance?

rbind("nnet" = postResample(predict(nnetTuned), train.y),
      "mars" = postResample(predict(marsTuned), train.y),
      "svm" = postResample(predict(svmRTuned), train.y),
      "knn" = postResample(predict(knnTuned), train.y)) |>
  kable(caption = "Resampling Performance")
Resampling Performance
RMSE Rsquared MAE
nnet 0.9981769 0.7326587 0.7407692
mars 1.0691083 0.6765888 0.8652505
svm 0.1754827 0.9931026 0.1686926
knn 1.1777361 0.6453121 0.9337500

The SVM model has the best resampling performance.

rbind("nnet" = nnet.results,
      "mars" = mars.results,
      "svm" = svm.results,
      "knn" = knn.results) |>
  kable(caption = "Test Performance")
Test Performance
RMSE Rsquared MAE
nnet 1.247386 0.4492970 0.9955054
mars 1.092737 0.5656352 0.8424734
svm 1.066615 0.5902318 0.8174537
knn 1.127135 0.5660707 0.9067857

The SVM model has the best performance for the training set.

(b)

Which predictors are the most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How to the top ten important predictors compare to the top ten predictors from the optimal linear model?

# variable importance
var.imp <- varImp(svmRTuned)
plot(var.imp, top = 10)

Among the 10 most important variables, 6 are process, and 4 are biological. For the linear model, 7 were process, and 3 were biological.

(c)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# store 10 most important variable names
m.imp <- var.imp$importance |>
  data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  row.names()

# correlation matrix of 10 most important variables and response variable
df |>
  select(all_of(c("Yield",m.imp))) |>
  cor() |>
  round(2) |>
  corrplot::corrplot(method = "square",
            order = "alphabet",
            tl.cex = 0.6,
            type = "lower")

featurePlot(df[,m.imp], df$Yield,
            between = list(x = 1, y = 1),
            type = c("p","smooth"))

These plots do reveal the intuition about the relationship with yield. We can see the positive and negative linear relationships between the variables and yield in the FeaturePlot correspond to the correlations in the matrix plot.

8.1

Recreate the simulated data from Exercise 7.2:

library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"

(a)

Fit a random forest model to all the predictors, then estimate the variable importance scores:

model1 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp1 <- varImp(model1, scale=FALSE)
rfImp1 |>
  as.data.frame() |>
  arrange(desc(Overall)) |>
  kable()
Overall
V1 8.8389088
V4 7.5882255
V2 6.4902306
V5 2.2742601
V3 0.6758316
V6 0.1743678
V7 0.1513658
V9 -0.0298983
V8 -0.0307894
V10 -0.0852922

Did the random forest model significantly use the uninformative predictors (V6 - V10)?

No, the uninformative predictors V6-V10 were not of significant use to the model.

(b)

Now add an additional predictor that is highly correlated with one of the informative predictors. For example:

simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the important score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

model2 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
  as.data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  kable()
Overall
V4 6.9392443
V1 6.2978074
V2 6.0803813
duplicate1 3.5641158
V5 2.0310409
V3 0.5841072
V6 0.0794764
V10 -0.0071509
V7 -0.0256641
V9 -0.0883946

The importance of the V1 variable is diminished as the duplicate1 variable takes on some of its importance in the model.

Adding another highly correlated variable,

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .08
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9551123
model3 <- randomForest(y ~ ., data = simulated,
                       importance = TRUE,
                       ntree = 1000)

rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |>
  as.data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  kable()
Overall
V4 7.3149466
V2 6.8801700
V1 5.2208505
duplicate2 2.9393051
duplicate1 2.4484915
V5 2.3448400
V3 0.4148658
V6 0.0847761
V10 0.0687704
V7 0.0477210

As another highly correlated variable is added, the importance of the V1 variable to the model is reduced futher.

(c)

Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?

model4 <- cforest(y ~ ., data = simulated)
# conditional = FALSE
varImp(model4, conditional = FALSE) |> 
  as.data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  kable()
Overall
V4 7.4074927
V2 5.9066877
V1 5.6833305
duplicate2 2.3032633
duplicate1 1.8237430
V5 1.7047104
V7 0.0143123
V6 0.0110430
V9 0.0037485
V3 -0.0056921
# conditional = TRUE
varImp(model4, conditional = TRUE) |> 
  as.data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  kable()
Overall
V4 5.8087533
V2 4.6372690
V1 2.3460991
V5 1.2106164
duplicate2 0.9041890
duplicate1 0.5438922
V6 0.0169200
V7 0.0133140
V10 0.0088332
V8 -0.0022507

Setting conditional = FALSE yields variable importance closer to the pattern as the traditional random forest model, however still different. The results from setting conditional = TRUE are very different from the traditional rf model.

(e)

Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?

Boosted Trees

set.seed(123)
gbmModel <- gbm(y ~ ., data = simulated, distribution = "gaussian")
summary(gbmModel)

##                   var   rel.inf
## V4                 V4 29.689768
## V2                 V2 23.365065
## V1                 V1 18.025324
## V5                 V5 11.580817
## duplicate1 duplicate1  7.670274
## V3                 V3  7.434302
## duplicate2 duplicate2  1.803061
## V6                 V6  0.431389
## V7                 V7  0.000000
## V8                 V8  0.000000
## V9                 V9  0.000000
## V10               V10  0.000000

The pattern with the boosted trees model is different from the traditional random forest but near to the modified conditional inference tree model.

Cubist

set.seed(123)
simulated.x <- simulated |> select(-y)
cubistModel <- cubist(simulated.x, simulated$y, committees = 100)
varImp(cubistModel) |>
  as.data.frame() |>
  arrange(desc(Overall)) |>
  head(10) |>
  kable()
Overall
V1 62.5
V2 62.5
V4 46.5
V3 43.0
V5 29.5
duplicate1 10.5
V8 5.0
duplicate2 4.0
V6 3.5
V7 0.0

Again, we have a different pattern. In the cubist model, V1 is the most important variable, and V4 is the third most important variable. In the cubist model, V6 is more important than duplicate1.

8.2

Use simulation to show tree bias with different granularities.

We’ll demonstrate by creating three predictors of different variances.

set.seed(100)
# create predictors
large <- sample(0:1000, 100, replace = TRUE)
medium <- sample(0:100, 100, replace = TRUE)
small <- sample(0:10, 100, replace = TRUE)

# create y value including error term
y <- large + medium + small + rnorm(100)

# create data frame
df <- data.frame(y, large, medium, small)

# fit single regression tree
rpartTree <- rpart(y ~ ., data = df)
varImp(rpartTree) |>
  as.data.frame() |>
  kable()
Overall
large 3.0999465
medium 0.5251032
small 0.0912745

We can see the model favors predictors with larger variance.

8.3

In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitude of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

(a)

Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?

The bagging fraction represents the fraction of data selected to train each iteration of the trees. A bagging fraction of 0.1 means 10% of the data is being selected for each iteration, subsequently, each iteration may be built using very different data, so there will be many different important predictors, whereas when, say, .9 or 90% of the data is being selected for each iteration, each tree should have similar important predictors, as they are seeing more similar data.

Boosting can be susceptible to over-fitting, as boosting will select an optimal weak learner at each iteration. To offset this, regularization or shrinkage is applied as a learning rate which represents the fraction of a current prediction to be added to the previous iteration’s predictions. As the learning rate increases, a larger fraction of predictions is added to the model - but this is not a good thing. A learning rate of 1 indicates no shrinkage, high error, focusing on fewer variables, the model is fast and greedy.

(b)

Which model do you think would be more predictive of other samples? A lower learning rate indicates slower learning and less errors, while a lower bagging fraction indicates a smaller proportion of training data used for fitting each iteration, which limits the model’s exposure to the full range of data, possibly leading to over-fitting. However the lower learning rate may provide more accurate generalization over unseen samples, so I believe the model on the left would be more predictive.

(c)

How would increasing interaction depth affect the slope of predictor importance for either model in Fig 8.24?

The interaction depth refers to the maximum number of splits or levels from the root node to the farthest leaf node. A larger depth provides the model with the ability to capture more complex interactions between predictors. Let’s demonstrate using the data the solubility data from the figure.

data(solubility)
set.seed(100)

ctrl <- trainControl(method = "cv", 
                     allowParallel = TRUE,
                     number = 10)

# build left grids/models at default interaction depth of 1 and 10
leftGrid1 <- expand.grid(n.trees = 100,
                        interaction.depth = 1,
                        shrinkage = 0.1,
                        n.minobsinnode = 10)

leftModel1 <- train(x = solTrainXtrans, y = solTrainY,
                   method = "gbm",
                   tuneGrid = leftGrid1,
                   trControl = ctrl,
                   verbose = FALSE)

leftGrid10 <- expand.grid(n.trees = 100,
                        interaction.depth = 10,
                        shrinkage = 0.1,
                        n.minobsinnode = 10)

leftModel10 <- train(x = solTrainXtrans, y = solTrainY,
                   method = "gbm",
                   tuneGrid = leftGrid10,
                   trControl = ctrl,
                   verbose = FALSE)

# build right grids / models
rightGrid1 <- expand.grid(n.trees = 100,
                        interaction.depth = 1,
                        shrinkage = 0.9,
                        n.minobsinnode = 10)
rightModel1 <- train(x = solTrainXtrans, y = solTrainY,
                   method = "gbm",
                   tuneGrid = rightGrid1,
                   trControl = ctrl,
                   verbose = FALSE)

rightGrid10 <- expand.grid(n.trees = 100,
                        interaction.depth = 10,
                        shrinkage = 0.9,
                        n.minobsinnode = 10)

rightModel10 <- train(x = solTrainXtrans, y = solTrainY,
                   method = "gbm",
                   tuneGrid = rightGrid10,
                   trControl = ctrl,
                   verbose = FALSE)

grid.arrange(plot(varImp(leftModel1), top = 20, main = "Left Model: Interaction Depth: 1"), 
             plot(varImp(leftModel10), top = 20, main = "Left Model: Interaction Depth: 10"),
             plot(varImp(rightModel1), top = 20, main = "Right Model: Interaction Depth: 1"),
             plot(varImp(rightModel10), top = 20, main = "Right Model: Interaction Depth: 10"),
             nrow = 2)

Unfortunately I wasn’t able to recreate the bagging component, but we can observe that for the left model, with more important variables, the slope becomes steeper as the interaction depth increases. Conversely, for the right model with important variables, the slope decreases as the interaction depth increases.

8.7

Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.

Pre-processing

data("ChemicalManufacturingProcess")
cmp <- ChemicalManufacturingProcess

set.seed(100)
trans <- preProcess(cmp, 
                   method = c("BoxCox", "knnImpute", "center", "scale"))

# apply the transformed data
df <- predict(trans, cmp)
df$Yield <- cmp$Yield

# split the data into training and testing sets
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
train.set <- df[trainRows, ]
test.set <- df[-trainRows, ]

# set aside predictors and response 
train.x <- train.set |> select(-Yield)
train.y <- train.set$Yield
test.x <- test.set |> select(-Yield)
test.y<- test.set$Yield

Single Tree Regression

set.seed(100)

registerDoSEQ()

rpartModel <- train(train.x, train.y,
                    method = "rpart",
                    tuneLength = 10)

rpartPred <- predict(rpartModel, test.x)
rpart.results <- postResample(rpartPred, test.y)

Bagged Trees

set.seed(100)

baggedModel <- train(train.x, train.y,
                    method = "treebag",
                    tuneLength = 10)

baggedPred <- predict(baggedModel, test.x)
bagged.results <- postResample(baggedPred, test.y)

Random Forest

set.seed(100)

cl <- makeCluster(4)
registerDoParallel(cl)

rfModel <- train(train.x, train.y,
                 method = "rf",
                 tuneLength = 10,
                 trControl = ctrl)

rfPred <- predict(rfModel, test.x)
rf.results <- postResample(rfPred, test.y)

Gradient Boosting

Tuning gradient boosting model over interaction depth, and shrinkage rate. Leaving the minimum terminal node size at default 10 to preserve perfomance capability.

set.seed(100)

grid <- expand.grid(n.trees=100,
                    interaction.depth=c(1, 10),
                    shrinkage=c(0.1, 0.5),
                    n.minobsinnode=10)

gbmModel <- train(train.x, train.y,
                  method = "gbm",
                  tuneLength = 10,
                  tuneGrid = grid,
                  trControl = ctrl,
                  verbose = FALSE)

gbmPred <- predict(gbmModel, test.x)
gbm.results <- postResample(gbmPred, test.y)

Cubist

set.seed(100)

cubistModel <- train(train.x, train.y,
                     method = "cubist",
                     tuneLength = 10,
                     trControl = ctrl)

cubistPred <- predict(cubistModel, test.x)
cubist.results <- postResample(cubistPred, test.y)

(a)

Which tree based regression model gives the optimal resampling and test set performance?

set.seed(100)
rbind("single regression" = postResample(predict(rpartModel), train.y),
      "bagged trees" = postResample(predict(baggedModel), train.y),
      "random forest" = postResample(predict(rfModel), train.y),
      "gradient boosted" = postResample(predict(gbmModel), train.y),
      "cubist" = postResample(predict(cubistModel), train.y)) |>
  kable(caption = "Resampling Performance")
Resampling Performance
RMSE Rsquared MAE
single regression 0.9498059 0.7447407 0.7709968
bagged trees 0.7888974 0.8445366 0.5899390
random forest 0.4507221 0.9655947 0.3352091
gradient boosted 0.3553975 0.9692540 0.2348292
cubist 0.1750849 0.9933089 0.1336386
rbind("single regression" = rpart.results,
      "bagged trees" = bagged.results,
      "random forest" = rf.results,
      "gradient boosted" = gbm.results,
      "cubist" = cubist.results) |>
  kable(caption = "Testing Performance")
Testing Performance
RMSE Rsquared MAE
single regression 1.2864914 0.4092729 1.0280374
bagged trees 1.0542203 0.6015022 0.8411462
random forest 0.9898729 0.6701921 0.8349617
gradient boosted 0.8807304 0.7170186 0.7322906
cubist 0.7648249 0.7928630 0.6056563

The cubist model provides the optimal resampling and testing performance.

(b)

Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

plot(varImp(cubistModel), top = 10, main = "Cubist Model")

In this model we see process comprises 6 of the top 10 most important variables, which is equivalent to what we observed in the nonlinear model. In the linear model, process comprises 7 of the top 10 most important variables.

(c)

Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

srt <- rpart(Yield ~ ., data=df)
fancyRpartPlot(srt, palette="GnBu", caption = "Single Regression Tree")

We can observe the following:

  • The root node for this tree is ManufacturingProcess32. We see the split occurs when the variable is < 0.22.
  • Each child node represents a subset of the data at the denoted split criteria
  • The tree depth is the number of splits in the tree. A deeper tree with more splits indicates more complex decision rules and potentially more overfitting to the data. In this chart, we have three splits.
  • The leaf nodes at the bottom represent the terminal nodes that contain the final predicted values for the response for its branch of the tree, we can see that the predicted yield is generally higher when the ManufacturingProcess32 is greater than 0.22, and we can observe where the exceptions to this occur.