DATA 624 HW5
library(tidyverse)
library(mlbench)
library(caret)
library(earth)
library(kernlab)
library(kableExtra)
library(AppliedPredictiveModeling)
library(corrplot)
library(randomForest)
library(party)
library(gbm)
library(Cubist)
library(rpart)
library(gridExtra)
library(doParallel)
library(rpart.plot)
library(rattle)
7.2
(a)
Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10sin(\pi x_1x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)
where the \(x\) values are random
variables uniformly distributed between \([0,1]\) (there are also 5 other
non-informative variables also created in the simulation). The package
mlbench
contains a function called
mlbench.friedman1
that simulates these data:
# number of cores for parallel processing
num_cores <- 4
cl <- makeCluster(num_cores)
registerDoParallel(cl)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
## we convert the 'x' data from a matrix to a dataframe
## one reason is that this will give the columns names
trainingData$x <- data.frame(trainingData$x)
## look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision
testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center","scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
knn.results <- postResample(pred = knnPred, obs = testData$y)
knn.results
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named x1-x5)?
NEURAL NETWORK MODEL
First we remove predcictors to ensure that the maximum absolute pairwise correlation between the predictors is less than 0.75,
## integer(0)
No predictors exceed a pairwise correlation of .75 or greater. Next, we create a specific candidate set of models to evaluate:
# specify and store the resampling method
ctrl <- trainControl(method = "cv",
allowParallel = TRUE,
number = 10)
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
set.seed(100)
nnetTune <- train(trainingData$x, trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center","scale"),
linout = TRUE,
trace = FALSE,
MaxNwts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
nnetTune
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.391555 0.7623243 1.887489
## 0.00 2 2.422221 0.7544913 1.917770
## 0.00 3 2.043540 0.8224555 1.630840
## 0.00 4 2.367168 0.7999596 1.801993
## 0.00 5 2.371240 0.7841913 1.785091
## 0.00 6 2.779869 0.7443282 2.008843
## 0.00 7 3.764096 0.6207519 2.619394
## 0.00 8 6.287055 0.4726517 3.479320
## 0.00 9 4.448435 0.5936014 2.869717
## 0.00 10 3.290886 0.6356008 2.396166
## 0.01 1 2.385379 0.7602941 1.887813
## 0.01 2 2.417335 0.7524249 1.930550
## 0.01 3 2.151211 0.8016017 1.701947
## 0.01 4 2.091926 0.8154380 1.676654
## 0.01 5 2.174252 0.8008679 1.739792
## 0.01 6 2.231823 0.8093149 1.796266
## 0.01 7 2.387448 0.7723920 1.913901
## 0.01 8 2.418258 0.7789421 1.899541
## 0.01 9 2.403486 0.7719540 1.969117
## 0.01 10 2.510406 0.7320214 2.028680
## 0.10 1 2.393969 0.7596425 1.894199
## 0.10 2 2.423111 0.7535683 1.922239
## 0.10 3 2.169933 0.7982368 1.726892
## 0.10 4 2.059078 0.8224162 1.648608
## 0.10 5 1.980727 0.8378612 1.586695
## 0.10 6 2.154588 0.8087362 1.693934
## 0.10 7 2.154261 0.8175103 1.668936
## 0.10 8 2.283792 0.7916330 1.807037
## 0.10 9 2.337881 0.7769658 1.805339
## 0.10 10 2.340156 0.7672626 1.866889
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag = FALSE.
nnetPred <- predict(nnetTune, newdata = testData$x)
nnet.results <- postResample(pred = nnetPred, obs = testData$y)
nnet.results
## RMSE Rsquared MAE
## 2.1328254 0.8228765 1.5911867
MARS MODEL (Multivariate Adaptive Regression Splines)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
marsTuned <- train(trainingData$x, trainingData$y,
method = "earth",
# Explicitly declare the candidate models to test,
tuneGrid = marsGrid,
trControl = ctrl)
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.327937 0.2544880 3.600474
## 1 3 3.572450 0.4912720 2.895811
## 1 4 2.596841 0.7183600 2.106341
## 1 5 2.370161 0.7659777 1.918669
## 1 6 2.276141 0.7881481 1.810001
## 1 7 1.766728 0.8751831 1.390215
## 1 8 1.780946 0.8723243 1.401345
## 1 9 1.665091 0.8819775 1.325515
## 1 10 1.663804 0.8821283 1.327657
## 1 11 1.657738 0.8822967 1.331730
## 1 12 1.653784 0.8827903 1.331504
## 1 13 1.648496 0.8823663 1.316407
## 1 14 1.639073 0.8841742 1.312833
## 1 15 1.639073 0.8841742 1.312833
## 1 16 1.639073 0.8841742 1.312833
## 1 17 1.639073 0.8841742 1.312833
## 1 18 1.639073 0.8841742 1.312833
## 1 19 1.639073 0.8841742 1.312833
## 1 20 1.639073 0.8841742 1.312833
## 1 21 1.639073 0.8841742 1.312833
## 1 22 1.639073 0.8841742 1.312833
## 1 23 1.639073 0.8841742 1.312833
## 1 24 1.639073 0.8841742 1.312833
## 1 25 1.639073 0.8841742 1.312833
## 1 26 1.639073 0.8841742 1.312833
## 1 27 1.639073 0.8841742 1.312833
## 1 28 1.639073 0.8841742 1.312833
## 1 29 1.639073 0.8841742 1.312833
## 1 30 1.639073 0.8841742 1.312833
## 1 31 1.639073 0.8841742 1.312833
## 1 32 1.639073 0.8841742 1.312833
## 1 33 1.639073 0.8841742 1.312833
## 1 34 1.639073 0.8841742 1.312833
## 1 35 1.639073 0.8841742 1.312833
## 1 36 1.639073 0.8841742 1.312833
## 1 37 1.639073 0.8841742 1.312833
## 1 38 1.639073 0.8841742 1.312833
## 2 2 4.327937 0.2544880 3.600474
## 2 3 3.572450 0.4912720 2.895811
## 2 4 2.661826 0.7070510 2.173471
## 2 5 2.404015 0.7578971 1.975387
## 2 6 2.243927 0.7914805 1.783072
## 2 7 1.856336 0.8605482 1.435682
## 2 8 1.754607 0.8763186 1.396841
## 2 9 1.603578 0.8938666 1.261361
## 2 10 1.492421 0.9084998 1.168700
## 2 11 1.317350 0.9292504 1.033926
## 2 12 1.304327 0.9320133 1.019108
## 2 13 1.277510 0.9323681 1.002927
## 2 14 1.269626 0.9350024 1.003346
## 2 15 1.266217 0.9359400 1.013893
## 2 16 1.268470 0.9354868 1.011414
## 2 17 1.268470 0.9354868 1.011414
## 2 18 1.268470 0.9354868 1.011414
## 2 19 1.268470 0.9354868 1.011414
## 2 20 1.268470 0.9354868 1.011414
## 2 21 1.268470 0.9354868 1.011414
## 2 22 1.268470 0.9354868 1.011414
## 2 23 1.268470 0.9354868 1.011414
## 2 24 1.268470 0.9354868 1.011414
## 2 25 1.268470 0.9354868 1.011414
## 2 26 1.268470 0.9354868 1.011414
## 2 27 1.268470 0.9354868 1.011414
## 2 28 1.268470 0.9354868 1.011414
## 2 29 1.268470 0.9354868 1.011414
## 2 30 1.268470 0.9354868 1.011414
## 2 31 1.268470 0.9354868 1.011414
## 2 32 1.268470 0.9354868 1.011414
## 2 33 1.268470 0.9354868 1.011414
## 2 34 1.268470 0.9354868 1.011414
## 2 35 1.268470 0.9354868 1.011414
## 2 36 1.268470 0.9354868 1.011414
## 2 37 1.268470 0.9354868 1.011414
## 2 38 1.268470 0.9354868 1.011414
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
Viewing the most important variables,
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.73
## X5 15.52
## X3 0.00
marsPred <- predict(marsTuned, newdata = testData$x)
mars.results <- postResample(pred = marsPred, obs = testData$y)
mars.results
## RMSE Rsquared MAE
## 1.1589948 0.9460418 0.9250230
SVM (Support Vector Machines)
svmRTuned <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProcess = c("center","scale"),
tuneLength = 14,
trControl = ctrl)
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.490737 0.8009120 1.982118
## 0.50 2.246868 0.8153042 1.774454
## 1.00 2.051872 0.8400992 1.614368
## 2.00 1.949707 0.8534618 1.524201
## 4.00 1.886125 0.8610205 1.465373
## 8.00 1.849240 0.8654699 1.436630
## 16.00 1.834611 0.8673617 1.429808
## 32.00 1.833165 0.8675812 1.428637
## 64.00 1.833165 0.8675812 1.428637
## 128.00 1.833165 0.8675812 1.428637
## 256.00 1.833165 0.8675812 1.428637
## 512.00 1.833165 0.8675812 1.428637
## 1024.00 1.833165 0.8675812 1.428637
## 2048.00 1.833165 0.8675812 1.428637
##
## Tuning parameter 'sigma' was held constant at a value of 0.06315483
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06315483 and C = 32.
svmRPred <- predict(svmRTuned, newdata = testData$x)
svm.results <- postResample(pred = svmRPred, obs = testData$y)
svm.results
## RMSE Rsquared MAE
## 2.0741633 0.8255819 1.5755127
Now we can compare the final results,
rbind("nnet" = nnet.results,
"mars" = mars.results,
"svm" = svm.results,
"knn" = knn.results) |>
kable()
RMSE | Rsquared | MAE | |
---|---|---|---|
nnet | 2.132825 | 0.8228765 | 1.591187 |
mars | 1.158995 | 0.9460418 | 0.925023 |
svm | 2.074163 | 0.8255819 | 1.575513 |
knn | 3.204060 | 0.6819919 | 2.568346 |
The MARS model performs best according to the RMSE.
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.73
## X5 15.52
## X3 0.00
The MARS model does indeed select X1-X5 as the most informative variables!
7.5
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Pre-processing
data("ChemicalManufacturingProcess")
cmp <- ChemicalManufacturingProcess
set.seed(100)
trans <- preProcess(cmp,
method = c("BoxCox", "knnImpute", "center", "scale"))
# apply the transformed data
df <- predict(trans, cmp)
df$Yield <- cmp$Yield
# split the data into training and testing sets
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
train.set <- df[trainRows, ]
test.set <- df[-trainRows, ]
# set aside predictors and response
train.x <- train.set |> select(-Yield)
train.y <- train.set$Yield
test.x <- test.set |> select(-Yield)
test.y<- test.set$Yield
Training Models
KNN MODEL
NEURAL NETWORK MODEL
tooHigh <- findCorrelation(cor(train.x), cutoff = .75)
trainXnnet <- train.x[, -tooHigh]
testXnnet <- test.x[, -tooHigh]
set.seed(100)
nnetGrid <- expand.grid(size = c(1:10),
decay = c(0, 0.01, 0.1),
bag = FALSE)
nnetTuned <- train(trainXnnet, train.y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center","scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
maxit = 500)
nnetPred <- predict(nnetTuned, testXnnet)
nnet.results <- postResample(nnetPred, test.y)
MARS MODEL
(a)
Which nonlinear regression model gives the optimal resampling and test set performance?
rbind("nnet" = postResample(predict(nnetTuned), train.y),
"mars" = postResample(predict(marsTuned), train.y),
"svm" = postResample(predict(svmRTuned), train.y),
"knn" = postResample(predict(knnTuned), train.y)) |>
kable(caption = "Resampling Performance")
RMSE | Rsquared | MAE | |
---|---|---|---|
nnet | 0.9981769 | 0.7326587 | 0.7407692 |
mars | 1.0691083 | 0.6765888 | 0.8652505 |
svm | 0.1754827 | 0.9931026 | 0.1686926 |
knn | 1.1777361 | 0.6453121 | 0.9337500 |
The SVM model has the best resampling performance.
rbind("nnet" = nnet.results,
"mars" = mars.results,
"svm" = svm.results,
"knn" = knn.results) |>
kable(caption = "Test Performance")
RMSE | Rsquared | MAE | |
---|---|---|---|
nnet | 1.247386 | 0.4492970 | 0.9955054 |
mars | 1.092737 | 0.5656352 | 0.8424734 |
svm | 1.066615 | 0.5902318 | 0.8174537 |
knn | 1.127135 | 0.5660707 | 0.9067857 |
The SVM model has the best performance for the training set.
(b)
Which predictors are the most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How to the top ten important predictors compare to the top ten predictors from the optimal linear model?
Among the 10 most important variables, 6 are process, and 4 are biological. For the linear model, 7 were process, and 3 were biological.
(c)
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
# store 10 most important variable names
m.imp <- var.imp$importance |>
data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
row.names()
# correlation matrix of 10 most important variables and response variable
df |>
select(all_of(c("Yield",m.imp))) |>
cor() |>
round(2) |>
corrplot::corrplot(method = "square",
order = "alphabet",
tl.cex = 0.6,
type = "lower")
These plots do reveal the intuition about the relationship with yield. We can see the positive and negative linear relationships between the variables and yield in the FeaturePlot correspond to the correlations in the matrix plot.
8.1
Recreate the simulated data from Exercise 7.2:
library(mlbench)
set.seed(200)
simulated <- mlbench.friedman1(200, sd=1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
(a)
Fit a random forest model to all the predictors, then estimate the variable importance scores:
model1 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp1 <- varImp(model1, scale=FALSE)
rfImp1 |>
as.data.frame() |>
arrange(desc(Overall)) |>
kable()
Overall | |
---|---|
V1 | 8.8389088 |
V4 | 7.5882255 |
V2 | 6.4902306 |
V5 | 2.2742601 |
V3 | 0.6758316 |
V6 | 0.1743678 |
V7 | 0.1513658 |
V9 | -0.0298983 |
V8 | -0.0307894 |
V10 | -0.0852922 |
Did the random forest model significantly use the uninformative predictors (V6 - V10)?
No, the uninformative predictors V6-V10 were not of significant use to the model.
(b)
Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
## [1] 0.9396216
Fit another random forest model to these data. Did the important score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
model2 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2 |>
as.data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
kable()
Overall | |
---|---|
V4 | 6.9392443 |
V1 | 6.2978074 |
V2 | 6.0803813 |
duplicate1 | 3.5641158 |
V5 | 2.0310409 |
V3 | 0.5841072 |
V6 | 0.0794764 |
V10 | -0.0071509 |
V7 | -0.0256641 |
V9 | -0.0883946 |
The importance of the V1 variable is diminished as the duplicate1 variable takes on some of its importance in the model.
Adding another highly correlated variable,
## [1] 0.9551123
model3 <- randomForest(y ~ ., data = simulated,
importance = TRUE,
ntree = 1000)
rfImp3 <- varImp(model3, scale = FALSE)
rfImp3 |>
as.data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
kable()
Overall | |
---|---|
V4 | 7.3149466 |
V2 | 6.8801700 |
V1 | 5.2208505 |
duplicate2 | 2.9393051 |
duplicate1 | 2.4484915 |
V5 | 2.3448400 |
V3 | 0.4148658 |
V6 | 0.0847761 |
V10 | 0.0687704 |
V7 | 0.0477210 |
As another highly correlated variable is added, the importance of the V1 variable to the model is reduced futher.
(c)
Use the cforest
function in the party
package to fit a random forest model using conditional inference trees.
The party
package function varimp
can
calculate predictor importance. The conditional
argument of
that function toggles between the traditional importance measure and the
modified version described in Strobl et al. (2007). Do these importances
show the same pattern as the traditional random forest model?
model4 <- cforest(y ~ ., data = simulated)
# conditional = FALSE
varImp(model4, conditional = FALSE) |>
as.data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
kable()
Overall | |
---|---|
V4 | 7.4074927 |
V2 | 5.9066877 |
V1 | 5.6833305 |
duplicate2 | 2.3032633 |
duplicate1 | 1.8237430 |
V5 | 1.7047104 |
V7 | 0.0143123 |
V6 | 0.0110430 |
V9 | 0.0037485 |
V3 | -0.0056921 |
# conditional = TRUE
varImp(model4, conditional = TRUE) |>
as.data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
kable()
Overall | |
---|---|
V4 | 5.8087533 |
V2 | 4.6372690 |
V1 | 2.3460991 |
V5 | 1.2106164 |
duplicate2 | 0.9041890 |
duplicate1 | 0.5438922 |
V6 | 0.0169200 |
V7 | 0.0133140 |
V10 | 0.0088332 |
V8 | -0.0022507 |
Setting conditional = FALSE
yields variable importance
closer to the pattern as the traditional random forest model, however
still different. The results from setting
conditional = TRUE
are very different from the traditional
rf model.
(e)
Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
Boosted Trees
## var rel.inf
## V4 V4 29.689768
## V2 V2 23.365065
## V1 V1 18.025324
## V5 V5 11.580817
## duplicate1 duplicate1 7.670274
## V3 V3 7.434302
## duplicate2 duplicate2 1.803061
## V6 V6 0.431389
## V7 V7 0.000000
## V8 V8 0.000000
## V9 V9 0.000000
## V10 V10 0.000000
The pattern with the boosted trees model is different from the traditional random forest but near to the modified conditional inference tree model.
Cubist
set.seed(123)
simulated.x <- simulated |> select(-y)
cubistModel <- cubist(simulated.x, simulated$y, committees = 100)
varImp(cubistModel) |>
as.data.frame() |>
arrange(desc(Overall)) |>
head(10) |>
kable()
Overall | |
---|---|
V1 | 62.5 |
V2 | 62.5 |
V4 | 46.5 |
V3 | 43.0 |
V5 | 29.5 |
duplicate1 | 10.5 |
V8 | 5.0 |
duplicate2 | 4.0 |
V6 | 3.5 |
V7 | 0.0 |
Again, we have a different pattern. In the cubist model, V1 is the most important variable, and V4 is the third most important variable. In the cubist model, V6 is more important than duplicate1.
8.2
Use simulation to show tree bias with different granularities.
We’ll demonstrate by creating three predictors of different variances.
set.seed(100)
# create predictors
large <- sample(0:1000, 100, replace = TRUE)
medium <- sample(0:100, 100, replace = TRUE)
small <- sample(0:10, 100, replace = TRUE)
# create y value including error term
y <- large + medium + small + rnorm(100)
# create data frame
df <- data.frame(y, large, medium, small)
# fit single regression tree
rpartTree <- rpart(y ~ ., data = df)
varImp(rpartTree) |>
as.data.frame() |>
kable()
Overall | |
---|---|
large | 3.0999465 |
medium | 0.5251032 |
small | 0.0912745 |
We can see the model favors predictors with larger variance.
8.3
In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitude of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
(a)
Why does the model on the right focus its importance on just the first few predictors, whereas the model on the left spreads importance across more predictors?
The bagging fraction represents the fraction of data selected to train each iteration of the trees. A bagging fraction of 0.1 means 10% of the data is being selected for each iteration, subsequently, each iteration may be built using very different data, so there will be many different important predictors, whereas when, say, .9 or 90% of the data is being selected for each iteration, each tree should have similar important predictors, as they are seeing more similar data.
Boosting can be susceptible to over-fitting, as boosting will select an optimal weak learner at each iteration. To offset this, regularization or shrinkage is applied as a learning rate which represents the fraction of a current prediction to be added to the previous iteration’s predictions. As the learning rate increases, a larger fraction of predictions is added to the model - but this is not a good thing. A learning rate of 1 indicates no shrinkage, high error, focusing on fewer variables, the model is fast and greedy.
(b)
Which model do you think would be more predictive of other samples? A lower learning rate indicates slower learning and less errors, while a lower bagging fraction indicates a smaller proportion of training data used for fitting each iteration, which limits the model’s exposure to the full range of data, possibly leading to over-fitting. However the lower learning rate may provide more accurate generalization over unseen samples, so I believe the model on the left would be more predictive.
(c)
How would increasing interaction depth affect the slope of predictor importance for either model in Fig 8.24?
The interaction depth refers to the maximum number of splits or
levels from the root node to the farthest leaf node. A larger depth
provides the model with the ability to capture more complex interactions
between predictors. Let’s demonstrate using the data the
solubility
data from the figure.
data(solubility)
set.seed(100)
ctrl <- trainControl(method = "cv",
allowParallel = TRUE,
number = 10)
# build left grids/models at default interaction depth of 1 and 10
leftGrid1 <- expand.grid(n.trees = 100,
interaction.depth = 1,
shrinkage = 0.1,
n.minobsinnode = 10)
leftModel1 <- train(x = solTrainXtrans, y = solTrainY,
method = "gbm",
tuneGrid = leftGrid1,
trControl = ctrl,
verbose = FALSE)
leftGrid10 <- expand.grid(n.trees = 100,
interaction.depth = 10,
shrinkage = 0.1,
n.minobsinnode = 10)
leftModel10 <- train(x = solTrainXtrans, y = solTrainY,
method = "gbm",
tuneGrid = leftGrid10,
trControl = ctrl,
verbose = FALSE)
# build right grids / models
rightGrid1 <- expand.grid(n.trees = 100,
interaction.depth = 1,
shrinkage = 0.9,
n.minobsinnode = 10)
rightModel1 <- train(x = solTrainXtrans, y = solTrainY,
method = "gbm",
tuneGrid = rightGrid1,
trControl = ctrl,
verbose = FALSE)
rightGrid10 <- expand.grid(n.trees = 100,
interaction.depth = 10,
shrinkage = 0.9,
n.minobsinnode = 10)
rightModel10 <- train(x = solTrainXtrans, y = solTrainY,
method = "gbm",
tuneGrid = rightGrid10,
trControl = ctrl,
verbose = FALSE)
grid.arrange(plot(varImp(leftModel1), top = 20, main = "Left Model: Interaction Depth: 1"),
plot(varImp(leftModel10), top = 20, main = "Left Model: Interaction Depth: 10"),
plot(varImp(rightModel1), top = 20, main = "Right Model: Interaction Depth: 1"),
plot(varImp(rightModel10), top = 20, main = "Right Model: Interaction Depth: 10"),
nrow = 2)
Unfortunately I wasn’t able to recreate the bagging component, but we can observe that for the left model, with more important variables, the slope becomes steeper as the interaction depth increases. Conversely, for the right model with important variables, the slope decreases as the interaction depth increases.
8.7
Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models.
Pre-processing
data("ChemicalManufacturingProcess")
cmp <- ChemicalManufacturingProcess
set.seed(100)
trans <- preProcess(cmp,
method = c("BoxCox", "knnImpute", "center", "scale"))
# apply the transformed data
df <- predict(trans, cmp)
df$Yield <- cmp$Yield
# split the data into training and testing sets
trainRows <- createDataPartition(df$Yield, p = .80, list = FALSE)
train.set <- df[trainRows, ]
test.set <- df[-trainRows, ]
# set aside predictors and response
train.x <- train.set |> select(-Yield)
train.y <- train.set$Yield
test.x <- test.set |> select(-Yield)
test.y<- test.set$Yield
Single Tree Regression
Bagged Trees
Random Forest
Gradient Boosting
Tuning gradient boosting model over interaction depth, and shrinkage rate. Leaving the minimum terminal node size at default 10 to preserve perfomance capability.
set.seed(100)
grid <- expand.grid(n.trees=100,
interaction.depth=c(1, 10),
shrinkage=c(0.1, 0.5),
n.minobsinnode=10)
gbmModel <- train(train.x, train.y,
method = "gbm",
tuneLength = 10,
tuneGrid = grid,
trControl = ctrl,
verbose = FALSE)
gbmPred <- predict(gbmModel, test.x)
gbm.results <- postResample(gbmPred, test.y)
Cubist
(a)
Which tree based regression model gives the optimal resampling and test set performance?
set.seed(100)
rbind("single regression" = postResample(predict(rpartModel), train.y),
"bagged trees" = postResample(predict(baggedModel), train.y),
"random forest" = postResample(predict(rfModel), train.y),
"gradient boosted" = postResample(predict(gbmModel), train.y),
"cubist" = postResample(predict(cubistModel), train.y)) |>
kable(caption = "Resampling Performance")
RMSE | Rsquared | MAE | |
---|---|---|---|
single regression | 0.9498059 | 0.7447407 | 0.7709968 |
bagged trees | 0.7888974 | 0.8445366 | 0.5899390 |
random forest | 0.4507221 | 0.9655947 | 0.3352091 |
gradient boosted | 0.3553975 | 0.9692540 | 0.2348292 |
cubist | 0.1750849 | 0.9933089 | 0.1336386 |
rbind("single regression" = rpart.results,
"bagged trees" = bagged.results,
"random forest" = rf.results,
"gradient boosted" = gbm.results,
"cubist" = cubist.results) |>
kable(caption = "Testing Performance")
RMSE | Rsquared | MAE | |
---|---|---|---|
single regression | 1.2864914 | 0.4092729 | 1.0280374 |
bagged trees | 1.0542203 | 0.6015022 | 0.8411462 |
random forest | 0.9898729 | 0.6701921 | 0.8349617 |
gradient boosted | 0.8807304 | 0.7170186 | 0.7322906 |
cubist | 0.7648249 | 0.7928630 | 0.6056563 |
The cubist model provides the optimal resampling and testing performance.
(b)
Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
In this model we see process comprises 6 of the top 10 most important
variables, which is equivalent to what we observed in the nonlinear
model. In the linear model, process comprises 7 of the top 10 most
important variables.
(c)
Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
srt <- rpart(Yield ~ ., data=df)
fancyRpartPlot(srt, palette="GnBu", caption = "Single Regression Tree")
We can observe the following:
- The root node for this tree is
ManufacturingProcess32
. We see the split occurs when the variable is < 0.22.
- Each child node represents a subset of the data at the denoted split criteria
- The tree depth is the number of splits in the tree. A deeper tree
with more splits indicates more complex decision rules and potentially
more overfitting to the data. In this chart, we have three splits.
- The leaf nodes at the bottom represent the terminal nodes that contain the final predicted values for the response for its branch of the tree, we can see that the predicted yield is generally higher when the ManufacturingProcess32 is greater than 0.22, and we can observe where the exceptions to this occur.