Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.
library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.3 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” lubridate 1.9.3 âś” tibble 3.2.1
## âś” purrr 1.0.2 âś” tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## âś– purrr::lift() masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
##
## The following object is masked from 'package:stringr':
##
## boundary
##
##
## Attaching package: 'party'
##
## The following object is masked from 'package:dplyr':
##
## where
library(partykit)
## Loading required package: libcoin
##
## Attaching package: 'partykit'
##
## The following objects are masked from 'package:party':
##
## cforest, ctree, ctree_control, edge_simple, mob, mob_control,
## node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
## node_terminal, varimp
library(AppliedPredictiveModeling)
library(Cubist)
library(rpart)
library(gbm)
## Loaded gbm 2.1.8.1
library(ipred)
8.1. Recreate the simulated data from Exercise 7.2:
set.seed(200)
simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
set.seed(123)
randfor_sim <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
varImp(randfor_sim, scale = FALSE)
## Overall
## V1 8.864171202
## V2 6.657873300
## V3 0.841086266
## V4 7.693792783
## V5 2.127834542
## V6 0.151948926
## V7 0.098320233
## V8 0.001378668
## V9 -0.038870929
## V10 0.014593264
Did the random forest model significantly use the uninformative predictors (V6 – V10)?
No, the predictors V6 - V10 are very insignificant especially when compared to the other variables used to build the trees. They are all near 0 besides V6 which is still fairly insignificant as a meaningful variable in the training
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9496502
The correlation between the dummy predictor is near perfect at around and represents most of the variation from the V1 column.
Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?
randfor_simcorr <- randomForest(y ~.,
data = simulated,
importance = TRUE,
ntree = 1000)
varImp(randfor_simcorr, scale = FALSE)
## Overall
## V1 5.84650034
## V2 5.96164114
## V3 0.73333384
## V4 6.62213280
## V5 2.13591969
## V6 0.09174444
## V7 -0.03849802
## V8 -0.10724130
## V9 -0.08747650
## V10 -0.06821814
## duplicate1 3.90639448
Yes, the important scores are dramatically modified, which as suggested by Strobl et Al. would change the values for significant and insignificant predictors for the model. As expected it’s lessened the previously two most influential predictors likely due to the model attributing their value to the new correlated column.
simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9411269
randfor_simcorr2 <- randomForest(y ~ .,
data = simulated,
importance = TRUE,
ntree = 1000)
varImp(randfor_simcorr2, scale = FALSE)
## Overall
## V1 4.74077847
## V2 6.16895867
## V3 0.42471658
## V4 7.14574311
## V5 2.07339187
## V6 0.12708648
## V7 0.02039238
## V8 -0.12199162
## V9 -0.05595316
## V10 0.05450384
## duplicate1 2.55791753
## duplicate2 2.90071515
As expected, V1 importance is decreased further as the model is splitting the significance across all 3 variables. This pattern would continue to happen as more highly correlated predictors are added to the model.
cf_rf <- cforest(y ~., data = simulated[,c(1:11)])
varimp(cf_rf)
## V1 V2 V3 V4 V5 V6
## 8.17055263 6.06418458 0.21387594 7.28763406 2.40274870 -0.02536830
## V7 V8 V9 V10
## 0.05517648 0.05512989 -0.09218832 -0.16158081
varimp(cf_rf,conditional=TRUE)
## V1 V2 V3 V4 V5 V6 V7
## 6.4199339 5.1798554 0.1693474 6.1651422 1.3470355 -0.3379530 -0.1054344
## V8 V9 V10
## -0.3929227 -0.3170242 -0.2274853
Both the conditional and original measures for calculating variable importance indicate that V1 is the most important predictor although the original method more closely mirrors the other random forest function whereas the conditional variant considers V1 and V4 to be more comparable predictors and values V2 more than the other models. They both show fairly similar results in terms of ranking variable importance as compared to the other algorithm implementation.
gbmTr <- gbm(y ~ ., data = simulated[,c(1:11)], distribution = "gaussian")
summary(gbmTr)
## var rel.inf
## V4 V4 29.6208855
## V1 V1 26.2629306
## V2 V2 23.6945298
## V5 V5 12.0208998
## V3 V3 7.9640377
## V6 V6 0.2451056
## V7 V7 0.1916109
## V8 V8 0.0000000
## V9 V9 0.0000000
## V10 V10 0.0000000
The gradient boosted model prioritizes V4 over V1 and V2 which is slightly different than the random forest variable importance rankings. V2 appears to be more comparable a predictor to V1 as well which is slightly different although this model also prioritizes V1 - V5 and regularizes V6-V10
set.seed(85)
cubistTr <- train(y ~ .,
data = simulated[, c(1:11)],
method = "cubist")
varImp(cubistTr$finalModel, scale = FALSE)
## Overall
## V1 72.0
## V3 42.0
## V2 54.5
## V4 49.0
## V5 40.0
## V6 11.0
## V7 0.0
## V8 0.0
## V9 0.0
## V10 0.0
The cubist model identifies V1 as the strongest predictor considerably higher than the rest of the variables. V4 is more in line with the rest of the significant predictors and this model follows the patterns shown in the other models that indicate that V6-V10 are not significant.
Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure
8.2. Use a simulation to show tree bias with different granularities.
The textbook states the following in regard to granularities for a single regression tree:
“Finally, these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010). Loh and Shih (1997) remarked that “The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”
Based on that prior research the predictors with more distinct values will be higher in the tree than granular ones even if they are less significant:
set.seed(21)
one <- sample(1:100 / 100, 200, replace = TRUE)
two <- sample(1:750 / 750, 200, replace = TRUE)
three <- sample(1:10000/ 10000, 200, replace = TRUE)
y_sim <- one + two + three + rnorm(200,mean=0,sd=5)
df_sim <- data.frame(one,two,three,y_sim)
rt_Sim <- rpart(y_sim ~ ., data = df_sim)
plot(as.party(rt_Sim))
varImp(rt_Sim)
## Overall
## one 0.8019093
## three 0.6986667
## two 0.8775057
As expected the top nodes are all taken from the third largest column (i.e. most distinct values). It is interesting that despite these higher node values within the tree that three still remains the least influential of the variables.
8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:
The high bagging and learning rates will cause the model to quickly hit a potential local minimum rather than a global minimum and will likely have less random samples of the training data. It also will take a large sample of those fewer iterations which means it may be more likely to capture specific variance in certain predictors that may make them seem most important for the response.
The expectation is that the graph on the left is more likely to better predict other samples given that there is a much smaller learning rate that is applied to smaller training sets. This should mean that it is learning based on many cuts of the data although this is likely much more computationally expensive to calculate.
The Boosting algorithm is designed by default to take the minimum depth of the trees (weak learners) and by increasing the interaction depth it would likely make other less significant predictors more important within the model. The slope would likely become more balanced as additional parameters would increase in importance given their inclusion in more models.
8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
data("ChemicalManufacturingProcess")
The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
x_chem <- ChemicalManufacturingProcess |> select(-Yield)
y_chem <- ChemicalManufacturingProcess |> select(Yield)
x_chem |> pivot_longer(cols=colnames(x_chem)) |> filter(is.na(value)) |> group_by(name) |> summarise(cnt=n()) |> arrange(-cnt)
## # A tibble: 28 Ă— 2
## name cnt
## <chr> <int>
## 1 ManufacturingProcess03 15
## 2 ManufacturingProcess11 10
## 3 ManufacturingProcess10 9
## 4 ManufacturingProcess25 5
## 5 ManufacturingProcess26 5
## 6 ManufacturingProcess27 5
## 7 ManufacturingProcess28 5
## 8 ManufacturingProcess29 5
## 9 ManufacturingProcess30 5
## 10 ManufacturingProcess31 5
## # ℹ 18 more rows
There are a couple of easy methods within preProcess that can be used to apply transformation; however, to preserve the original shape of the data the K-Nearest Neighbors method will be used.
Let’s apply KNN to find the nearest neighbors for imputation:
imputed <- preProcess(x_chem,method=c('center','scale','knnImpute'),k=5)
x_chem_imp <- predict(imputed,x_chem)
set.seed(21)
chem_train <- caret::createDataPartition(y_chem$Yield, p = .70, list= FALSE)
x_train_chem <- x_chem_imp[chem_train,]
x_test_chem <- x_chem_imp[-chem_train,]
y_train_chem <- y_chem[chem_train,]
y_test_chem <- y_chem[-chem_train,]
set.seed(45)
rpartTune <- train(x_train_chem, y_train_chem,
method = "rpart2",
tuneLength = 10,
trControl = trainControl(method = "cv"))
## note: only 9 possible values of the max tree depth from the initial fit.
## Truncating the grid to 9 .
rpartPred <- predict(rpartTune,x_test_chem)
baggedTree <- ipredbagg(y_train_chem,x_train_chem)
baggedPred <- predict(baggedTree,x_test_chem)
rfTree <- randomForest(x_train_chem, y_train_chem)
rfPred <- predict(rfTree,x_test_chem)
gbmModel <- gbm.fit(x_train_chem, y_train_chem, distribution = "gaussian")
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 3.5279 nan 0.0010 0.0024
## 2 3.5251 nan 0.0010 0.0024
## 3 3.5228 nan 0.0010 0.0030
## 4 3.5200 nan 0.0010 0.0028
## 5 3.5174 nan 0.0010 0.0025
## 6 3.5145 nan 0.0010 0.0027
## 7 3.5115 nan 0.0010 0.0025
## 8 3.5088 nan 0.0010 0.0024
## 9 3.5058 nan 0.0010 0.0023
## 10 3.5031 nan 0.0010 0.0028
## 20 3.4753 nan 0.0010 0.0027
## 40 3.4205 nan 0.0010 0.0031
## 60 3.3627 nan 0.0010 0.0023
## 80 3.3124 nan 0.0010 0.0022
## 100 3.2607 nan 0.0010 0.0020
gbmPred <- predict(gbmModel,x_test_chem)
## Using 100 trees...
cubistChemTree <- cubist(x_train_chem, y_train_chem,committees=50)
cubistPred <- predict(cubistChemTree,x_test_chem)
pred_chem_eval <- rbind(postResample(rpartPred,y_test_chem),postResample(baggedPred,y_test_chem),postResample(rfPred,y_test_chem),postResample(gbmPred,y_test_chem),postResample(cubistPred,y_test_chem))
rownames(pred_chem_eval) <- c('Single Tree: CART','Bagged Tree','Random Forest','Gradient Boosting','Cubist')
knitr::kable(pred_chem_eval)
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| Single Tree: CART | 1.4255108 | 0.3928835 | 1.1118891 |
| Bagged Tree | 1.1128390 | 0.5970195 | 0.8512993 |
| Random Forest | 0.9288144 | 0.7473898 | 0.7048974 |
| Gradient Boosting | 1.6866519 | 0.3386843 | 1.3921380 |
| Cubist | 0.8396574 | 0.7713848 | 0.6937094 |
The best model across the metrics evaluated (RMSE, \(R^2\), and MAE) is the Cubist algorithm.
cubist_imp <- varImp(cubistChemTree,scale=TRUE)
as.data.frame(cubist_imp) |> arrange(-Overall) |> head(10)
## Overall
## ManufacturingProcess32 45.5
## ManufacturingProcess17 23.0
## ManufacturingProcess13 21.5
## BiologicalMaterial03 20.5
## ManufacturingProcess09 18.0
## BiologicalMaterial06 15.5
## BiologicalMaterial02 12.0
## ManufacturingProcess39 10.5
## ManufacturingProcess26 10.5
## ManufacturingProcess33 10.5
The manufacturing processes are the most important predictors in the tree based model which is consistent with the other forms of modeling that were assessed for this data set. The linear and non-linear models had extremely similar variable importance predictors and rankings. The top predictor was the same for all of the models although ManufacturingProcess17 was more important in the tree based model whereas it was the fifth highest in the other two modeling type results. The tree based model also found that ManufacturingProcess09 to be of more significant than the other models and found manufacturing processes 39, 26 and 33 in the top 10.
set.seed(10)
rpartTree <- rpart(Yield ~ ., data = cbind(y_chem,x_chem_imp))
plot(as.party(rpartTree),gp=gpar(fontsize = 10))
The top split is consistent with the variable importance although it isn’t necessarily representative of future unseen data when evaluating only one tree. There are more biological material splits than expected and certain predictors are in the tree that did not seem significant in other more robust modeling.