8.1 Recreate the simulated data from Exercise 7.2:

library(AppliedPredictiveModeling)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(mlbench)
library(lattice)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.4     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
model1 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000)
rfImp1 <- varImp(model1, scale = FALSE)
rfImp1
##         Overall
## V1   8.84289608
## V2   6.74508245
## V3   0.67830653
## V4   7.75934674
## V5   2.23628276
## V6   0.11429887
## V7   0.03724747
## V8  -0.05349642
## V9  -0.04495617
## V10  0.03863205

Did the random forest model significantly use the uninformative predictors (V6 – V10)? The random forest model not really use the uniformative predictors as we can see there is a drop on the overall, V3 was seen to have the lowest value of 0.816.

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9396216

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1? The importance score changed for V1, adding the predictor shifted all the importance score and most importantly decreased the importance score for V1.

model2 <- randomForest(y ~ ., data = simulated, importance = TRUE, ntree = 1000) #create new model with new predictor
rfImp2 <- varImp(model2, scale = FALSE)
rfImp2
##                 Overall
## V1          6.008319352
## V2          6.308908170
## V3          0.571604465
## V4          7.187015958
## V5          2.131040245
## V6          0.211304611
## V7          0.025100355
## V8         -0.116980037
## V9         -0.003679481
## V10         0.024878337
## duplicate1  3.618101735
  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model? The patterns for random forest model are somwhat similar to cforest model, where V4 is the most used predictor, but as for randomforest the V2 and V4 were in the higher end while in cforest V1 and V2 seem to have a value witha smaller difference and V4 seems to have way high difference compared to V2. The conditional argument causes the importance value to decrease even more for all predictors and the uninformative predictors were still not used.
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
## 
##     where
model3 <- cforest(y ~ ., data = simulated) #create cforest model
cfImp<- varImp(model3, conditional = FALSE)#not conditional
cfImp
##                Overall
## V1          6.66043731
## V2          6.19942369
## V3          0.04107867
## V4          7.80355110
## V5          1.86181623
## V6         -0.03307225
## V7         -0.02430149
## V8         -0.04881394
## V9         -0.01498777
## V10        -0.06177746
## duplicate1  2.92518849
cfImp2 <- varImp(model3, conditional = TRUE)#conditional
cfImp2
##                 Overall
## V1          3.083108242
## V2          4.795715585
## V3          0.022210074
## V4          6.034177480
## V5          1.046247690
## V6          0.014036296
## V7         -0.005315101
## V8         -0.021781266
## V9         -0.037483560
## V10        -0.015706127
## duplicate1  0.998947469
  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur? For cubist V1 is the most used predictor. For boosted tree somewhat follows the same patterns as cforest and random forest where V4 is the most used predictor and V6-V10 and not really used.
#cubist
library(Cubist)
CubistTuned<-train(y ~ ., data = simulated[1:11], method= "cubist") #extracted data without the duplicate predictor.
cbImp<-varImp(CubistTuned$finalModel,scale=FALSE)
cbImp
##     Overall
## V1     72.0
## V3     42.0
## V2     54.5
## V4     49.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0
#boosted tree
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
model5 <- gbm(y ~ ., data = simulated[1:11], distribution = "gaussian")
summary(model5)

##     var    rel.inf
## V4   V4 31.1593997
## V1   V1 26.7185707
## V2   V2 22.7065781
## V5   V5 10.2980583
## V3   V3  8.2801207
## V6   V6  0.5851738
## V7   V7  0.1356062
## V8   V8  0.1164925
## V9   V9  0.0000000
## V10 V10  0.0000000

8.2. Use a simulation to show tree bias with different granularities.

I used the cubist model and you would see that the more important variables were the ones with the more granularity which was A1

library(rpart)
library(caret)
#creating a data
set.seed(123)
A1<-sample(1:10000, 100,replace=TRUE)
A2<-sample(1:100, 100,replace=TRUE)
A3<-sample(1:10, 100, replace=TRUE)
y<-A1+A2+A3+rnorm(200)
Df<-data.frame(Y=y, A1, A2, A3)
#modeling
model6 <- train(Y ~ ., data = Df, method = "cubist")

# variable of importance
cbImp2<-varImp(model6$finalModel,scale=FALSE)
cbImp2
##    Overall
## A2    54.5
## A1    75.0
## A3    12.5

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9: (a) Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors? The model on the right has a higher bagging fraction and learning rate, meaning it have a more aggressive learning with less regulating, it will quickly focus on the strong predictors causing the less strong one to have a less chance of contributing. The high bagging for model on the right means that the entire dataset will be seen in the tree, causing a less randomness leading to the same few strong predictors to be used over and over again. The model on the left would have the oppsite since the bagging fraction and learning rate are at 0.1, a low bagging fraction leads to a tree that slices that data allowing an unbias predictor usage, but it will learn slowly due to the learning rate allowing weak predictors show value overtime.

  1. Which model do you think would be more predictive of other samples? The model on the left conservative and would be more predictive as it has a lower parameter value meaning it would avoid overfitting data and bias on predictors (more diversity due to low bagging) compared to the model on the right with a parameter of 0.9.

  2. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24? The steeper the slope the fewer predictors dominate and the flatter the slope the more importance is spread out. If the interaction depth is increased for the model on the right it would mean a steeper slope leading the model lean even more the few strong predictors, as it would be more in depth on the relation between the few top predictors. But for the model on the left if the interaction depth were to be increased would lead the tree to to interact with the diverse data, might cause a slight flatten depending on how many low predictor contribute to the model.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models: (a) Which tree-based regression model gives the optimal resampling and test set performance? The cubist model had preformed the best for the RMSE and Rsquare.

data("ChemicalManufacturingProcess")
set.seed(200)
data(ChemicalManufacturingProcess)

Prepro<-preProcess(ChemicalManufacturingProcess, method= "knnImpute")
ProcessPredictorImputed<-predict(Prepro, ChemicalManufacturingProcess)
ProcessPredictorImputed<-as.data.frame(ProcessPredictorImputed)

ProcessPredictor<-select(ProcessPredictorImputed, -Yield)

Yield_df<-ProcessPredictorImputed$Yield

ProcessPredictorf<-ProcessPredictor[, -nearZeroVar(ProcessPredictor)]

trainIndex2<-createDataPartition(Yield_df, p=0.8, list = FALSE)
train_data2<- ProcessPredictorf[trainIndex2,]
train_response2<-Yield_df[trainIndex2]
test_data2<-ProcessPredictorf[-trainIndex2,]
test_reponse2<-Yield_df[-trainIndex2]

ctrl<-trainControl(method = "cv", number=10)

set.seed(200)
tree_model<-train(train_data2, train_response2, method = "rpart", trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
cb_model<-train(train_data2, train_response2, method = "cubist", trControl = ctrl)
rf_model<-train(train_data2, train_response2, method = "rf", trControl = ctrl)
gbm_model<-train(train_data2, train_response2, method = "gbm", trControl = ctrl)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8948             nan     0.1000    0.0497
##      2        0.8354             nan     0.1000    0.0620
##      3        0.7871             nan     0.1000    0.0335
##      4        0.7262             nan     0.1000    0.0523
##      5        0.6822             nan     0.1000    0.0439
##      6        0.6443             nan     0.1000    0.0119
##      7        0.6089             nan     0.1000    0.0318
##      8        0.5822             nan     0.1000    0.0190
##      9        0.5622             nan     0.1000    0.0153
##     10        0.5316             nan     0.1000    0.0165
##     20        0.4001             nan     0.1000    0.0040
##     40        0.3012             nan     0.1000   -0.0005
##     60        0.2432             nan     0.1000   -0.0035
##     80        0.2053             nan     0.1000   -0.0002
##    100        0.1733             nan     0.1000   -0.0022
##    120        0.1485             nan     0.1000   -0.0002
##    140        0.1312             nan     0.1000   -0.0010
##    150        0.1233             nan     0.1000   -0.0017
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8435             nan     0.1000    0.0840
##      2        0.7608             nan     0.1000    0.0517
##      3        0.6929             nan     0.1000    0.0533
##      4        0.6585             nan     0.1000    0.0095
##      5        0.6037             nan     0.1000    0.0430
##      6        0.5626             nan     0.1000    0.0358
##      7        0.5330             nan     0.1000    0.0219
##      8        0.5028             nan     0.1000    0.0312
##      9        0.4765             nan     0.1000    0.0130
##     10        0.4561             nan     0.1000    0.0091
##     20        0.3242             nan     0.1000   -0.0102
##     40        0.2043             nan     0.1000   -0.0031
##     60        0.1473             nan     0.1000   -0.0028
##     80        0.1119             nan     0.1000   -0.0009
##    100        0.0853             nan     0.1000   -0.0008
##    120        0.0653             nan     0.1000   -0.0018
##    140        0.0537             nan     0.1000   -0.0006
##    150        0.0473             nan     0.1000   -0.0003
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8740             nan     0.1000    0.0615
##      2        0.7856             nan     0.1000    0.0595
##      3        0.7132             nan     0.1000    0.0538
##      4        0.6573             nan     0.1000    0.0429
##      5        0.6091             nan     0.1000    0.0383
##      6        0.5577             nan     0.1000    0.0325
##      7        0.5115             nan     0.1000    0.0298
##      8        0.4738             nan     0.1000    0.0292
##      9        0.4447             nan     0.1000    0.0179
##     10        0.4241             nan     0.1000    0.0176
##     20        0.2693             nan     0.1000   -0.0027
##     40        0.1529             nan     0.1000   -0.0000
##     60        0.0963             nan     0.1000    0.0007
##     80        0.0624             nan     0.1000   -0.0015
##    100        0.0404             nan     0.1000   -0.0015
##    120        0.0298             nan     0.1000   -0.0008
##    140        0.0211             nan     0.1000   -0.0005
##    150        0.0183             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8657             nan     0.1000    0.0634
##      2        0.8222             nan     0.1000    0.0055
##      3        0.7665             nan     0.1000    0.0506
##      4        0.7214             nan     0.1000    0.0407
##      5        0.6784             nan     0.1000    0.0313
##      6        0.6469             nan     0.1000    0.0307
##      7        0.6144             nan     0.1000    0.0259
##      8        0.5900             nan     0.1000    0.0197
##      9        0.5640             nan     0.1000    0.0168
##     10        0.5391             nan     0.1000    0.0167
##     20        0.4006             nan     0.1000   -0.0020
##     40        0.3010             nan     0.1000   -0.0033
##     60        0.2490             nan     0.1000    0.0003
##     80        0.2152             nan     0.1000    0.0005
##    100        0.1919             nan     0.1000   -0.0022
##    120        0.1714             nan     0.1000   -0.0030
##    140        0.1540             nan     0.1000   -0.0028
##    150        0.1443             nan     0.1000    0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8684             nan     0.1000    0.0633
##      2        0.8045             nan     0.1000    0.0555
##      3        0.7337             nan     0.1000    0.0745
##      4        0.6704             nan     0.1000    0.0422
##      5        0.6337             nan     0.1000    0.0290
##      6        0.5858             nan     0.1000    0.0402
##      7        0.5412             nan     0.1000    0.0310
##      8        0.5198             nan     0.1000    0.0106
##      9        0.4921             nan     0.1000    0.0251
##     10        0.4648             nan     0.1000    0.0002
##     20        0.3103             nan     0.1000    0.0035
##     40        0.1974             nan     0.1000   -0.0012
##     60        0.1400             nan     0.1000   -0.0012
##     80        0.1045             nan     0.1000   -0.0021
##    100        0.0820             nan     0.1000   -0.0009
##    120        0.0637             nan     0.1000   -0.0021
##    140        0.0515             nan     0.1000   -0.0001
##    150        0.0458             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8479             nan     0.1000    0.0989
##      2        0.7635             nan     0.1000    0.0648
##      3        0.7062             nan     0.1000    0.0412
##      4        0.6538             nan     0.1000    0.0389
##      5        0.6016             nan     0.1000    0.0287
##      6        0.5610             nan     0.1000    0.0248
##      7        0.5183             nan     0.1000    0.0289
##      8        0.4806             nan     0.1000    0.0363
##      9        0.4450             nan     0.1000    0.0141
##     10        0.4190             nan     0.1000    0.0203
##     20        0.2772             nan     0.1000   -0.0027
##     40        0.1632             nan     0.1000   -0.0068
##     60        0.1026             nan     0.1000   -0.0010
##     80        0.0681             nan     0.1000   -0.0004
##    100        0.0500             nan     0.1000   -0.0013
##    120        0.0359             nan     0.1000   -0.0014
##    140        0.0264             nan     0.1000   -0.0002
##    150        0.0231             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8484             nan     0.1000    0.0514
##      2        0.7890             nan     0.1000    0.0456
##      3        0.7197             nan     0.1000    0.0621
##      4        0.6709             nan     0.1000    0.0412
##      5        0.6288             nan     0.1000    0.0315
##      6        0.5983             nan     0.1000    0.0234
##      7        0.5660             nan     0.1000    0.0248
##      8        0.5475             nan     0.1000    0.0159
##      9        0.5199             nan     0.1000    0.0071
##     10        0.4932             nan     0.1000    0.0211
##     20        0.3715             nan     0.1000    0.0002
##     40        0.2830             nan     0.1000   -0.0007
##     60        0.2383             nan     0.1000   -0.0015
##     80        0.2107             nan     0.1000    0.0005
##    100        0.1870             nan     0.1000   -0.0037
##    120        0.1647             nan     0.1000   -0.0004
##    140        0.1469             nan     0.1000   -0.0001
##    150        0.1376             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8404             nan     0.1000    0.0704
##      2        0.7728             nan     0.1000    0.0577
##      3        0.6962             nan     0.1000    0.0693
##      4        0.6309             nan     0.1000    0.0492
##      5        0.5872             nan     0.1000    0.0289
##      6        0.5587             nan     0.1000    0.0108
##      7        0.5260             nan     0.1000    0.0209
##      8        0.4880             nan     0.1000    0.0243
##      9        0.4662             nan     0.1000    0.0116
##     10        0.4461             nan     0.1000    0.0125
##     20        0.3140             nan     0.1000   -0.0018
##     40        0.2039             nan     0.1000   -0.0041
##     60        0.1471             nan     0.1000   -0.0024
##     80        0.1120             nan     0.1000   -0.0023
##    100        0.0806             nan     0.1000   -0.0012
##    120        0.0621             nan     0.1000   -0.0005
##    140        0.0481             nan     0.1000   -0.0004
##    150        0.0428             nan     0.1000   -0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8175             nan     0.1000    0.0834
##      2        0.7279             nan     0.1000    0.0757
##      3        0.6525             nan     0.1000    0.0511
##      4        0.5930             nan     0.1000    0.0628
##      5        0.5474             nan     0.1000    0.0361
##      6        0.4972             nan     0.1000    0.0450
##      7        0.4608             nan     0.1000    0.0277
##      8        0.4302             nan     0.1000    0.0231
##      9        0.4187             nan     0.1000    0.0001
##     10        0.3990             nan     0.1000    0.0078
##     20        0.2574             nan     0.1000    0.0038
##     40        0.1487             nan     0.1000   -0.0033
##     60        0.0925             nan     0.1000   -0.0023
##     80        0.0630             nan     0.1000   -0.0006
##    100        0.0415             nan     0.1000   -0.0013
##    120        0.0294             nan     0.1000   -0.0006
##    140        0.0205             nan     0.1000   -0.0004
##    150        0.0181             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8014             nan     0.1000    0.0745
##      2        0.7416             nan     0.1000    0.0600
##      3        0.6971             nan     0.1000    0.0457
##      4        0.6582             nan     0.1000    0.0310
##      5        0.6305             nan     0.1000    0.0249
##      6        0.5976             nan     0.1000    0.0296
##      7        0.5598             nan     0.1000    0.0318
##      8        0.5254             nan     0.1000    0.0242
##      9        0.5088             nan     0.1000    0.0079
##     10        0.4830             nan     0.1000    0.0114
##     20        0.3553             nan     0.1000    0.0046
##     40        0.2671             nan     0.1000   -0.0012
##     60        0.2235             nan     0.1000   -0.0005
##     80        0.1911             nan     0.1000   -0.0014
##    100        0.1634             nan     0.1000   -0.0018
##    120        0.1448             nan     0.1000   -0.0013
##    140        0.1289             nan     0.1000   -0.0045
##    150        0.1228             nan     0.1000   -0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8025             nan     0.1000    0.0728
##      2        0.7131             nan     0.1000    0.0785
##      3        0.6416             nan     0.1000    0.0606
##      4        0.6005             nan     0.1000    0.0337
##      5        0.5552             nan     0.1000    0.0346
##      6        0.5162             nan     0.1000    0.0250
##      7        0.4751             nan     0.1000    0.0221
##      8        0.4444             nan     0.1000    0.0188
##      9        0.4261             nan     0.1000    0.0052
##     10        0.4054             nan     0.1000    0.0058
##     20        0.2830             nan     0.1000    0.0007
##     40        0.1880             nan     0.1000    0.0021
##     60        0.1352             nan     0.1000   -0.0010
##     80        0.1028             nan     0.1000   -0.0023
##    100        0.0803             nan     0.1000   -0.0017
##    120        0.0637             nan     0.1000   -0.0004
##    140        0.0512             nan     0.1000   -0.0007
##    150        0.0467             nan     0.1000   -0.0012
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.7736             nan     0.1000    0.0863
##      2        0.6889             nan     0.1000    0.0626
##      3        0.6214             nan     0.1000    0.0517
##      4        0.5584             nan     0.1000    0.0357
##      5        0.5170             nan     0.1000    0.0293
##      6        0.4878             nan     0.1000    0.0125
##      7        0.4551             nan     0.1000    0.0185
##      8        0.4249             nan     0.1000    0.0191
##      9        0.3975             nan     0.1000    0.0184
##     10        0.3758             nan     0.1000    0.0103
##     20        0.2352             nan     0.1000    0.0055
##     40        0.1366             nan     0.1000   -0.0000
##     60        0.0885             nan     0.1000   -0.0000
##     80        0.0590             nan     0.1000   -0.0021
##    100        0.0418             nan     0.1000   -0.0014
##    120        0.0326             nan     0.1000   -0.0001
##    140        0.0241             nan     0.1000   -0.0001
##    150        0.0210             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8808             nan     0.1000    0.0613
##      2        0.8012             nan     0.1000    0.0637
##      3        0.7559             nan     0.1000    0.0217
##      4        0.7060             nan     0.1000    0.0470
##      5        0.6712             nan     0.1000    0.0282
##      6        0.6281             nan     0.1000    0.0282
##      7        0.5928             nan     0.1000    0.0399
##      8        0.5683             nan     0.1000    0.0159
##      9        0.5482             nan     0.1000    0.0175
##     10        0.5225             nan     0.1000    0.0127
##     20        0.3734             nan     0.1000   -0.0044
##     40        0.2649             nan     0.1000    0.0007
##     60        0.2136             nan     0.1000   -0.0024
##     80        0.1792             nan     0.1000   -0.0010
##    100        0.1554             nan     0.1000   -0.0007
##    120        0.1370             nan     0.1000    0.0001
##    140        0.1231             nan     0.1000   -0.0018
##    150        0.1176             nan     0.1000   -0.0018
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8683             nan     0.1000    0.0889
##      2        0.7816             nan     0.1000    0.0686
##      3        0.7171             nan     0.1000    0.0567
##      4        0.6620             nan     0.1000    0.0513
##      5        0.6163             nan     0.1000    0.0418
##      6        0.5683             nan     0.1000    0.0369
##      7        0.5223             nan     0.1000    0.0356
##      8        0.4772             nan     0.1000    0.0230
##      9        0.4541             nan     0.1000    0.0072
##     10        0.4301             nan     0.1000    0.0170
##     20        0.2817             nan     0.1000   -0.0001
##     40        0.1800             nan     0.1000   -0.0010
##     60        0.1317             nan     0.1000   -0.0004
##     80        0.1013             nan     0.1000   -0.0013
##    100        0.0785             nan     0.1000   -0.0019
##    120        0.0650             nan     0.1000   -0.0015
##    140        0.0511             nan     0.1000   -0.0005
##    150        0.0463             nan     0.1000   -0.0012
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8580             nan     0.1000    0.1199
##      2        0.7718             nan     0.1000    0.0818
##      3        0.6915             nan     0.1000    0.0851
##      4        0.6397             nan     0.1000    0.0457
##      5        0.5893             nan     0.1000    0.0316
##      6        0.5484             nan     0.1000    0.0303
##      7        0.5033             nan     0.1000    0.0341
##      8        0.4658             nan     0.1000    0.0266
##      9        0.4371             nan     0.1000    0.0177
##     10        0.4084             nan     0.1000    0.0198
##     20        0.2485             nan     0.1000   -0.0037
##     40        0.1345             nan     0.1000   -0.0034
##     60        0.0853             nan     0.1000   -0.0006
##     80        0.0601             nan     0.1000   -0.0014
##    100        0.0435             nan     0.1000   -0.0012
##    120        0.0323             nan     0.1000   -0.0006
##    140        0.0229             nan     0.1000   -0.0007
##    150        0.0200             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.9054             nan     0.1000    0.0702
##      2        0.8445             nan     0.1000    0.0582
##      3        0.7724             nan     0.1000    0.0458
##      4        0.7163             nan     0.1000    0.0459
##      5        0.6909             nan     0.1000    0.0149
##      6        0.6570             nan     0.1000    0.0173
##      7        0.6308             nan     0.1000    0.0179
##      8        0.6049             nan     0.1000    0.0199
##      9        0.5805             nan     0.1000    0.0135
##     10        0.5481             nan     0.1000    0.0251
##     20        0.3990             nan     0.1000    0.0027
##     40        0.2897             nan     0.1000    0.0011
##     60        0.2381             nan     0.1000   -0.0021
##     80        0.2043             nan     0.1000   -0.0008
##    100        0.1805             nan     0.1000   -0.0037
##    120        0.1588             nan     0.1000   -0.0027
##    140        0.1400             nan     0.1000   -0.0021
##    150        0.1312             nan     0.1000   -0.0010
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8556             nan     0.1000    0.0958
##      2        0.7704             nan     0.1000    0.0732
##      3        0.6946             nan     0.1000    0.0508
##      4        0.6377             nan     0.1000    0.0544
##      5        0.5899             nan     0.1000    0.0350
##      6        0.5474             nan     0.1000    0.0239
##      7        0.5221             nan     0.1000    0.0196
##      8        0.4970             nan     0.1000    0.0149
##      9        0.4679             nan     0.1000    0.0209
##     10        0.4469             nan     0.1000    0.0191
##     20        0.2980             nan     0.1000    0.0059
##     40        0.1926             nan     0.1000   -0.0016
##     60        0.1324             nan     0.1000   -0.0008
##     80        0.0985             nan     0.1000   -0.0026
##    100        0.0767             nan     0.1000   -0.0019
##    120        0.0621             nan     0.1000   -0.0021
##    140        0.0495             nan     0.1000   -0.0008
##    150        0.0443             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8651             nan     0.1000    0.0876
##      2        0.7830             nan     0.1000    0.0573
##      3        0.7001             nan     0.1000    0.0714
##      4        0.6371             nan     0.1000    0.0587
##      5        0.5918             nan     0.1000    0.0329
##      6        0.5512             nan     0.1000    0.0219
##      7        0.5166             nan     0.1000    0.0220
##      8        0.4747             nan     0.1000    0.0226
##      9        0.4377             nan     0.1000    0.0196
##     10        0.4093             nan     0.1000    0.0191
##     20        0.2616             nan     0.1000    0.0073
##     40        0.1453             nan     0.1000   -0.0024
##     60        0.0958             nan     0.1000   -0.0009
##     80        0.0679             nan     0.1000   -0.0021
##    100        0.0497             nan     0.1000   -0.0005
##    120        0.0352             nan     0.1000   -0.0011
##    140        0.0255             nan     0.1000   -0.0009
##    150        0.0222             nan     0.1000   -0.0003
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8936             nan     0.1000    0.0828
##      2        0.8175             nan     0.1000    0.0681
##      3        0.7468             nan     0.1000    0.0370
##      4        0.7015             nan     0.1000    0.0290
##      5        0.6552             nan     0.1000    0.0322
##      6        0.6187             nan     0.1000    0.0255
##      7        0.5953             nan     0.1000    0.0140
##      8        0.5635             nan     0.1000    0.0189
##      9        0.5446             nan     0.1000    0.0085
##     10        0.5196             nan     0.1000    0.0161
##     20        0.3908             nan     0.1000    0.0066
##     40        0.2972             nan     0.1000   -0.0047
##     60        0.2457             nan     0.1000   -0.0018
##     80        0.2108             nan     0.1000   -0.0009
##    100        0.1841             nan     0.1000   -0.0042
##    120        0.1584             nan     0.1000   -0.0007
##    140        0.1442             nan     0.1000   -0.0026
##    150        0.1368             nan     0.1000   -0.0020
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8942             nan     0.1000    0.0903
##      2        0.8144             nan     0.1000    0.0715
##      3        0.7568             nan     0.1000    0.0391
##      4        0.7051             nan     0.1000    0.0247
##      5        0.6384             nan     0.1000    0.0484
##      6        0.5894             nan     0.1000    0.0289
##      7        0.5549             nan     0.1000    0.0208
##      8        0.5244             nan     0.1000    0.0211
##      9        0.4920             nan     0.1000    0.0274
##     10        0.4688             nan     0.1000    0.0120
##     20        0.3235             nan     0.1000   -0.0042
##     40        0.2113             nan     0.1000   -0.0022
##     60        0.1528             nan     0.1000   -0.0018
##     80        0.1117             nan     0.1000   -0.0011
##    100        0.0883             nan     0.1000   -0.0004
##    120        0.0696             nan     0.1000   -0.0017
##    140        0.0569             nan     0.1000   -0.0014
##    150        0.0513             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8795             nan     0.1000    0.0896
##      2        0.7964             nan     0.1000    0.0737
##      3        0.7297             nan     0.1000    0.0653
##      4        0.6653             nan     0.1000    0.0646
##      5        0.6186             nan     0.1000    0.0300
##      6        0.5641             nan     0.1000    0.0437
##      7        0.5183             nan     0.1000    0.0313
##      8        0.4826             nan     0.1000    0.0232
##      9        0.4498             nan     0.1000    0.0157
##     10        0.4248             nan     0.1000    0.0228
##     20        0.2773             nan     0.1000    0.0013
##     40        0.1583             nan     0.1000    0.0008
##     60        0.1026             nan     0.1000   -0.0015
##     80        0.0716             nan     0.1000   -0.0026
##    100        0.0495             nan     0.1000   -0.0013
##    120        0.0360             nan     0.1000   -0.0009
##    140        0.0267             nan     0.1000   -0.0001
##    150        0.0233             nan     0.1000   -0.0007
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8637             nan     0.1000    0.0580
##      2        0.7981             nan     0.1000    0.0526
##      3        0.7418             nan     0.1000    0.0573
##      4        0.7005             nan     0.1000    0.0364
##      5        0.6657             nan     0.1000    0.0231
##      6        0.6328             nan     0.1000    0.0189
##      7        0.6049             nan     0.1000    0.0267
##      8        0.5750             nan     0.1000    0.0232
##      9        0.5493             nan     0.1000    0.0178
##     10        0.5315             nan     0.1000    0.0134
##     20        0.3896             nan     0.1000    0.0005
##     40        0.2900             nan     0.1000   -0.0038
##     60        0.2395             nan     0.1000   -0.0031
##     80        0.2052             nan     0.1000   -0.0020
##    100        0.1813             nan     0.1000   -0.0025
##    120        0.1612             nan     0.1000   -0.0027
##    140        0.1440             nan     0.1000   -0.0010
##    150        0.1377             nan     0.1000   -0.0020
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8369             nan     0.1000    0.0757
##      2        0.7617             nan     0.1000    0.0560
##      3        0.7189             nan     0.1000    0.0097
##      4        0.6450             nan     0.1000    0.0670
##      5        0.6167             nan     0.1000    0.0189
##      6        0.5739             nan     0.1000    0.0334
##      7        0.5398             nan     0.1000    0.0229
##      8        0.5065             nan     0.1000    0.0284
##      9        0.4866             nan     0.1000    0.0061
##     10        0.4621             nan     0.1000    0.0138
##     20        0.3132             nan     0.1000    0.0063
##     40        0.1966             nan     0.1000   -0.0027
##     60        0.1453             nan     0.1000   -0.0024
##     80        0.1079             nan     0.1000   -0.0028
##    100        0.0856             nan     0.1000   -0.0008
##    120        0.0710             nan     0.1000   -0.0007
##    140        0.0559             nan     0.1000   -0.0015
##    150        0.0504             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8330             nan     0.1000    0.1072
##      2        0.7492             nan     0.1000    0.0594
##      3        0.6782             nan     0.1000    0.0663
##      4        0.6164             nan     0.1000    0.0387
##      5        0.5664             nan     0.1000    0.0223
##      6        0.5242             nan     0.1000    0.0396
##      7        0.4701             nan     0.1000    0.0272
##      8        0.4376             nan     0.1000    0.0188
##      9        0.4210             nan     0.1000    0.0087
##     10        0.3991             nan     0.1000    0.0034
##     20        0.2732             nan     0.1000   -0.0043
##     40        0.1600             nan     0.1000   -0.0026
##     60        0.1012             nan     0.1000   -0.0017
##     80        0.0700             nan     0.1000   -0.0007
##    100        0.0489             nan     0.1000   -0.0007
##    120        0.0355             nan     0.1000   -0.0005
##    140        0.0275             nan     0.1000   -0.0008
##    150        0.0243             nan     0.1000   -0.0005
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8899             nan     0.1000    0.0792
##      2        0.8289             nan     0.1000    0.0456
##      3        0.7741             nan     0.1000    0.0447
##      4        0.7326             nan     0.1000    0.0318
##      5        0.6972             nan     0.1000    0.0291
##      6        0.6690             nan     0.1000    0.0329
##      7        0.6323             nan     0.1000    0.0362
##      8        0.6003             nan     0.1000    0.0218
##      9        0.5819             nan     0.1000    0.0112
##     10        0.5541             nan     0.1000    0.0158
##     20        0.3952             nan     0.1000    0.0064
##     40        0.2810             nan     0.1000   -0.0023
##     60        0.2332             nan     0.1000    0.0015
##     80        0.1936             nan     0.1000   -0.0029
##    100        0.1676             nan     0.1000   -0.0022
##    120        0.1442             nan     0.1000   -0.0010
##    140        0.1263             nan     0.1000   -0.0009
##    150        0.1202             nan     0.1000   -0.0022
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8853             nan     0.1000    0.0746
##      2        0.7737             nan     0.1000    0.0974
##      3        0.7254             nan     0.1000    0.0430
##      4        0.6653             nan     0.1000    0.0543
##      5        0.6152             nan     0.1000    0.0365
##      6        0.5743             nan     0.1000    0.0354
##      7        0.5358             nan     0.1000    0.0272
##      8        0.5011             nan     0.1000    0.0259
##      9        0.4744             nan     0.1000    0.0211
##     10        0.4549             nan     0.1000    0.0073
##     20        0.3021             nan     0.1000    0.0040
##     40        0.1938             nan     0.1000   -0.0018
##     60        0.1399             nan     0.1000    0.0006
##     80        0.1046             nan     0.1000   -0.0018
##    100        0.0833             nan     0.1000   -0.0016
##    120        0.0684             nan     0.1000   -0.0020
##    140        0.0557             nan     0.1000   -0.0004
##    150        0.0507             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8562             nan     0.1000    0.1088
##      2        0.7579             nan     0.1000    0.0794
##      3        0.6772             nan     0.1000    0.0660
##      4        0.6215             nan     0.1000    0.0325
##      5        0.5651             nan     0.1000    0.0422
##      6        0.5230             nan     0.1000    0.0194
##      7        0.4915             nan     0.1000    0.0277
##      8        0.4518             nan     0.1000    0.0278
##      9        0.4232             nan     0.1000    0.0222
##     10        0.3991             nan     0.1000    0.0149
##     20        0.2502             nan     0.1000    0.0005
##     40        0.1397             nan     0.1000    0.0022
##     60        0.0949             nan     0.1000    0.0002
##     80        0.0667             nan     0.1000   -0.0019
##    100        0.0465             nan     0.1000   -0.0010
##    120        0.0323             nan     0.1000   -0.0005
##    140        0.0233             nan     0.1000   -0.0001
##    150        0.0199             nan     0.1000   -0.0003
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8904             nan     0.1000    0.0781
##      2        0.8234             nan     0.1000    0.0493
##      3        0.7741             nan     0.1000    0.0439
##      4        0.7179             nan     0.1000    0.0416
##      5        0.6803             nan     0.1000    0.0267
##      6        0.6465             nan     0.1000    0.0293
##      7        0.6135             nan     0.1000    0.0145
##      8        0.5915             nan     0.1000    0.0086
##      9        0.5693             nan     0.1000    0.0162
##     10        0.5474             nan     0.1000    0.0132
##     20        0.3977             nan     0.1000    0.0029
##     40        0.2929             nan     0.1000   -0.0024
##     60        0.2421             nan     0.1000    0.0002
##     80        0.2053             nan     0.1000   -0.0013
##    100        0.1788             nan     0.1000   -0.0014
##    120        0.1612             nan     0.1000   -0.0015
##    140        0.1437             nan     0.1000   -0.0003
##    150        0.1373             nan     0.1000   -0.0018
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8883             nan     0.1000    0.0863
##      2        0.8034             nan     0.1000    0.0664
##      3        0.7316             nan     0.1000    0.0532
##      4        0.6728             nan     0.1000    0.0412
##      5        0.6205             nan     0.1000    0.0219
##      6        0.5785             nan     0.1000    0.0314
##      7        0.5524             nan     0.1000    0.0103
##      8        0.5225             nan     0.1000    0.0176
##      9        0.5045             nan     0.1000    0.0039
##     10        0.4752             nan     0.1000    0.0230
##     20        0.3221             nan     0.1000    0.0020
##     40        0.2045             nan     0.1000   -0.0021
##     60        0.1431             nan     0.1000   -0.0045
##     80        0.1084             nan     0.1000   -0.0018
##    100        0.0833             nan     0.1000   -0.0010
##    120        0.0670             nan     0.1000   -0.0014
##    140        0.0544             nan     0.1000   -0.0009
##    150        0.0490             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8653             nan     0.1000    0.0995
##      2        0.7762             nan     0.1000    0.0818
##      3        0.7033             nan     0.1000    0.0608
##      4        0.6457             nan     0.1000    0.0475
##      5        0.5995             nan     0.1000    0.0376
##      6        0.5538             nan     0.1000    0.0404
##      7        0.5127             nan     0.1000    0.0316
##      8        0.4852             nan     0.1000    0.0164
##      9        0.4580             nan     0.1000    0.0112
##     10        0.4266             nan     0.1000    0.0180
##     20        0.2560             nan     0.1000    0.0025
##     40        0.1544             nan     0.1000   -0.0029
##     60        0.0956             nan     0.1000    0.0011
##     80        0.0609             nan     0.1000   -0.0015
##    100        0.0411             nan     0.1000   -0.0006
##    120        0.0283             nan     0.1000   -0.0006
##    140        0.0204             nan     0.1000   -0.0006
##    150        0.0175             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8715             nan     0.1000    0.0525
##      2        0.7927             nan     0.1000    0.0631
##      3        0.7058             nan     0.1000    0.0831
##      4        0.6558             nan     0.1000    0.0246
##      5        0.6036             nan     0.1000    0.0463
##      6        0.5597             nan     0.1000    0.0359
##      7        0.5093             nan     0.1000    0.0242
##      8        0.4770             nan     0.1000    0.0282
##      9        0.4443             nan     0.1000    0.0225
##     10        0.4272             nan     0.1000    0.0093
##     20        0.2704             nan     0.1000   -0.0006
##     40        0.1654             nan     0.1000   -0.0022
##     60        0.1071             nan     0.1000   -0.0023
##     80        0.0728             nan     0.1000   -0.0017
##    100        0.0520             nan     0.1000   -0.0013
##    120        0.0376             nan     0.1000   -0.0006
##    140        0.0287             nan     0.1000   -0.0009
##    150        0.0240             nan     0.1000   -0.0005
#extract RMSE and Rsquare
extract_metrics <- function(model) {
  best_result <- model$results[which.min(model$results$RMSE), ]
  return(c(RMSE = best_result$RMSE, Rsquared = best_result$Rsquared))
}

#put together all models
Trees <- data.frame(
  Model = c("Decision Tree", "Cubist", "Random Forest", "Gradient Boosting"),
  t(sapply(list(tree_model, cb_model, rf_model, gbm_model), extract_metrics))
)

print(Trees)
##               Model      RMSE  Rsquared
## 1     Decision Tree 0.6599408 0.5513568
## 2            Cubist 0.5222891 0.7345227
## 3     Random Forest 0.5645715 0.7080794
## 4 Gradient Boosting 0.5716477 0.6663822
  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models? The top ten predictors used were ManufacturingProcess32, ManufacturingProcess17, ManufacturingProcess01, ManufacturingProcess29, ManufacturingProcess30, BiologicalMaterial06, ManufacturingProcess13, ManufacturingProcess09, BiologicalMaterial10, ManufacturingProcess33. The process variable dominate the list. All three models have processing variables dominate ManufacturingProcess32 was the one used the most for all three. BiologicalMaterial06 was second used for nonlinear and for tree and linear BiologicalMaterial06 wasn’t used as much.
cb_varImp<-varImp(cb_model)
top10_cb<-cb_varImp$importance|>
  rownames_to_column("Predictor")|>
  arrange(desc(Overall))|>
  slice(1:10)
print(top10_cb)
##                 Predictor   Overall
## 1  ManufacturingProcess32 100.00000
## 2  ManufacturingProcess17  76.47059
## 3  ManufacturingProcess01  41.17647
## 4  ManufacturingProcess29  27.73109
## 5  ManufacturingProcess30  26.05042
## 6    BiologicalMaterial06  26.05042
## 7  ManufacturingProcess13  25.21008
## 8  ManufacturingProcess09  23.52941
## 9    BiologicalMaterial10  19.32773
## 10 ManufacturingProcess33  19.32773
  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield? The plot shows more biological variables compared to the cubist model, the tree also shows the relationship between the different top variables.
train <- train_data2
train$Yield <- train_response2
library(rpart.plot)
rpart_tree <- rpart(Yield ~ ., data = train)
rpart.plot(rpart_tree)