Do problems 8.1, 8.2, 8.3, and 8.7 in Kuhn and Johnson. Please submit the Rpubs link along with the .rmd file.

library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.3     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” lubridate 1.9.3     âś” tibble    3.2.1
## âś” purrr     1.0.2     âś” tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## âś– purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## 
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## 
## Attaching package: 'party'
## 
## The following object is masked from 'package:dplyr':
## 
##     where
library(partykit)
## Loading required package: libcoin
## 
## Attaching package: 'partykit'
## 
## The following objects are masked from 'package:party':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp
library(AppliedPredictiveModeling)
library(Cubist)
library(rpart)
library(gbm)
## Loaded gbm 2.1.8.1
library(ipred)

8.1. Recreate the simulated data from Exercise 7.2:

set.seed(200)

simulated <- mlbench.friedman1(200, sd = 1)
simulated <- cbind(simulated$x, simulated$y)
simulated <- as.data.frame(simulated)
colnames(simulated)[ncol(simulated)] <- "y"
  1. Fit a random forest model to all of the predictors, then estimate the variable importance scores:
set.seed(123)
randfor_sim <- randomForest(y ~ ., 
                            data = simulated,
                            importance = TRUE,
                            ntree = 1000) 

varImp(randfor_sim, scale = FALSE)
##          Overall
## V1   8.864171202
## V2   6.657873300
## V3   0.841086266
## V4   7.693792783
## V5   2.127834542
## V6   0.151948926
## V7   0.098320233
## V8   0.001378668
## V9  -0.038870929
## V10  0.014593264

Did the random forest model significantly use the uninformative predictors (V6 – V10)?

No, the predictors V6 - V10 are very insignificant especially when compared to the other variables used to build the trees. They are all near 0 besides V6 which is still fairly insignificant as a meaningful variable in the training

  1. Now add an additional predictor that is highly correlated with one of the informative predictors. For example:
simulated$duplicate1 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate1, simulated$V1)
## [1] 0.9496502

The correlation between the dummy predictor is near perfect at around and represents most of the variation from the V1 column.

Fit another random forest model to these data. Did the importance score for V1 change? What happens when you add another predictor that is also highly correlated with V1?

randfor_simcorr <- randomForest(y ~., 
                                data = simulated, 
                                importance = TRUE, 
                                ntree = 1000)
varImp(randfor_simcorr, scale = FALSE)
##                Overall
## V1          5.84650034
## V2          5.96164114
## V3          0.73333384
## V4          6.62213280
## V5          2.13591969
## V6          0.09174444
## V7         -0.03849802
## V8         -0.10724130
## V9         -0.08747650
## V10        -0.06821814
## duplicate1  3.90639448

Yes, the important scores are dramatically modified, which as suggested by Strobl et Al. would change the values for significant and insignificant predictors for the model. As expected it’s lessened the previously two most influential predictors likely due to the model attributing their value to the new correlated column.

simulated$duplicate2 <- simulated$V1 + rnorm(200) * .1
cor(simulated$duplicate2, simulated$V1)
## [1] 0.9411269
randfor_simcorr2 <- randomForest(y ~ ., 
                                 data = simulated, 
                                 importance = TRUE,
                                 ntree = 1000)
varImp(randfor_simcorr2, scale = FALSE)
##                Overall
## V1          4.74077847
## V2          6.16895867
## V3          0.42471658
## V4          7.14574311
## V5          2.07339187
## V6          0.12708648
## V7          0.02039238
## V8         -0.12199162
## V9         -0.05595316
## V10         0.05450384
## duplicate1  2.55791753
## duplicate2  2.90071515

As expected, V1 importance is decreased further as the model is splitting the significance across all 3 variables. This pattern would continue to happen as more highly correlated predictors are added to the model.

  1. Use the cforest function in the party package to fit a random forest model using conditional inference trees. The party package function varimp can calculate predictor importance. The conditional argument of that function toggles between the traditional importance measure and the modified version described in Strobl et al. (2007). Do these importances show the same pattern as the traditional random forest model?
cf_rf <- cforest(y ~., data = simulated[,c(1:11)])

varimp(cf_rf)
##          V1          V2          V3          V4          V5          V6 
##  8.17055263  6.06418458  0.21387594  7.28763406  2.40274870 -0.02536830 
##          V7          V8          V9         V10 
##  0.05517648  0.05512989 -0.09218832 -0.16158081
varimp(cf_rf,conditional=TRUE)
##         V1         V2         V3         V4         V5         V6         V7 
##  6.4199339  5.1798554  0.1693474  6.1651422  1.3470355 -0.3379530 -0.1054344 
##         V8         V9        V10 
## -0.3929227 -0.3170242 -0.2274853

Both the conditional and original measures for calculating variable importance indicate that V1 is the most important predictor although the original method more closely mirrors the other random forest function whereas the conditional variant considers V1 and V4 to be more comparable predictors and values V2 more than the other models. They both show fairly similar results in terms of ranking variable importance as compared to the other algorithm implementation.

  1. Repeat this process with different tree models, such as boosted trees and Cubist. Does the same pattern occur?
gbmTr <- gbm(y ~ ., data = simulated[,c(1:11)], distribution = "gaussian")
summary(gbmTr)

##     var    rel.inf
## V4   V4 29.6208855
## V1   V1 26.2629306
## V2   V2 23.6945298
## V5   V5 12.0208998
## V3   V3  7.9640377
## V6   V6  0.2451056
## V7   V7  0.1916109
## V8   V8  0.0000000
## V9   V9  0.0000000
## V10 V10  0.0000000

The gradient boosted model prioritizes V4 over V1 and V2 which is slightly different than the random forest variable importance rankings. V2 appears to be more comparable a predictor to V1 as well which is slightly different although this model also prioritizes V1 - V5 and regularizes V6-V10

set.seed(85)
cubistTr <- train(y ~ .,
                     data = simulated[, c(1:11)],
                     method = "cubist")
varImp(cubistTr$finalModel, scale = FALSE)
##     Overall
## V1     72.0
## V3     42.0
## V2     54.5
## V4     49.0
## V5     40.0
## V6     11.0
## V7      0.0
## V8      0.0
## V9      0.0
## V10     0.0

The cubist model identifies V1 as the strongest predictor considerably higher than the rest of the variables. V4 is more in line with the rest of the significant predictors and this model follows the patterns shown in the other models that indicate that V6-V10 are not significant.

Variable Importance
Variable Importance

Fig. 8.24: A comparison of variable importance magnitudes for differing values of the bagging fraction and shrinkage parameters. Both tuning parameters are set to 0.1 in the left figure. Both are set to 0.9 in the right figure

8.2. Use a simulation to show tree bias with different granularities.

The textbook states the following in regard to granularities for a single regression tree:

“Finally, these trees suffer from selection bias: predictors with a higher number of distinct values are favored over more granular predictors (Loh and Shih 1997; Carolin et al. 2007; Loh 2010). Loh and Shih (1997) remarked that “The danger occurs when a data set consists of a mix of informative and noise variables, and the noise variables have many more splits than the informative variables. Then there is a high probability that the noise variables will be chosen to split the top nodes of the tree. Pruning will produce either a tree with misleading structure or no tree at all.”

Based on that prior research the predictors with more distinct values will be higher in the tree than granular ones even if they are less significant:

set.seed(21)


one <- sample(1:100 / 100, 200, replace = TRUE)
two <- sample(1:750 / 750, 200, replace = TRUE)
three <- sample(1:10000/ 10000, 200, replace = TRUE)

y_sim <- one + two + three + rnorm(200,mean=0,sd=5)

df_sim <- data.frame(one,two,three,y_sim) 

rt_Sim <- rpart(y_sim ~ ., data = df_sim)

plot(as.party(rt_Sim))

varImp(rt_Sim)
##         Overall
## one   0.8019093
## three 0.6986667
## two   0.8775057

As expected the top nodes are all taken from the third largest column (i.e. most distinct values). It is interesting that despite these higher node values within the tree that three still remains the least influential of the variables.

8.3. In stochastic gradient boosting the bagging fraction and learning rate will govern the construction of the trees as they are guided by the gradient. Although the optimal values of these parameters should be obtained through the tuning process, it is helpful to understand how the magnitudes of these parameters affect magnitudes of variable importance. Figure 8.24 provides the variable importance plots for boosting using two extreme values for the bagging fraction (0.1 and 0.9) and the learning rate (0.1 and 0.9) for the solubility data. The left-hand plot has both parameters set to 0.1, and the right-hand plot has both set to 0.9:

  1. Why does the model on the right focus its importance on just the first few of predictors, whereas the model on the left spreads importance across more predictors?

The high bagging and learning rates will cause the model to quickly hit a potential local minimum rather than a global minimum and will likely have less random samples of the training data. It also will take a large sample of those fewer iterations which means it may be more likely to capture specific variance in certain predictors that may make them seem most important for the response.

  1. Which model do you think would be more predictive of other samples?

The expectation is that the graph on the left is more likely to better predict other samples given that there is a much smaller learning rate that is applied to smaller training sets. This should mean that it is learning based on many cuts of the data although this is likely much more computationally expensive to calculate.

  1. How would increasing interaction depth affect the slope of predictor importance for either model in Fig. 8.24?

The Boosting algorithm is designed by default to take the minimum depth of the trees (weak learners) and by increasing the interaction depth it would likely make other less significant predictors more important within the model. The slope would likely become more balanced as additional parameters would increase in importance given their inclusion in more models.

8.7. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

summary(ChemicalManufacturingProcess)
##      Yield       BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##  Min.   :35.25   Min.   :4.580        Min.   :46.87        Min.   :56.97       
##  1st Qu.:38.75   1st Qu.:5.978        1st Qu.:52.68        1st Qu.:64.98       
##  Median :39.97   Median :6.305        Median :55.09        Median :67.22       
##  Mean   :40.18   Mean   :6.411        Mean   :55.69        Mean   :67.70       
##  3rd Qu.:41.48   3rd Qu.:6.870        3rd Qu.:58.74        3rd Qu.:70.43       
##  Max.   :46.34   Max.   :8.810        Max.   :64.75        Max.   :78.25       
##                                                                                
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##  Min.   : 9.38        Min.   :13.24        Min.   :40.60       
##  1st Qu.:11.24        1st Qu.:17.23        1st Qu.:46.05       
##  Median :12.10        Median :18.49        Median :48.46       
##  Mean   :12.35        Mean   :18.60        Mean   :48.91       
##  3rd Qu.:13.22        3rd Qu.:19.90        3rd Qu.:51.34       
##  Max.   :23.09        Max.   :24.85        Max.   :59.38       
##                                                                
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##  Min.   :100.0        Min.   :15.88        Min.   :11.44       
##  1st Qu.:100.0        1st Qu.:17.06        1st Qu.:12.60       
##  Median :100.0        Median :17.51        Median :12.84       
##  Mean   :100.0        Mean   :17.49        Mean   :12.85       
##  3rd Qu.:100.0        3rd Qu.:17.88        3rd Qu.:13.13       
##  Max.   :100.8        Max.   :19.14        Max.   :14.08       
##                                                                
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##  Min.   :1.770        Min.   :135.8        Min.   :18.35       
##  1st Qu.:2.460        1st Qu.:143.8        1st Qu.:19.73       
##  Median :2.710        Median :146.1        Median :20.12       
##  Mean   :2.801        Mean   :147.0        Mean   :20.20       
##  3rd Qu.:2.990        3rd Qu.:149.6        3rd Qu.:20.75       
##  Max.   :6.870        Max.   :158.7        Max.   :22.21       
##                                                                
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##  Min.   : 0.00          Min.   : 0.00          Min.   :1.47          
##  1st Qu.:10.80          1st Qu.:19.30          1st Qu.:1.53          
##  Median :11.40          Median :21.00          Median :1.54          
##  Mean   :11.21          Mean   :16.68          Mean   :1.54          
##  3rd Qu.:12.15          3rd Qu.:21.50          3rd Qu.:1.55          
##  Max.   :14.10          Max.   :22.50          Max.   :1.60          
##  NA's   :1              NA's   :3              NA's   :15            
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##  Min.   :911.0          Min.   : 923.0         Min.   :203.0         
##  1st Qu.:928.0          1st Qu.: 986.8         1st Qu.:205.7         
##  Median :934.0          Median : 999.2         Median :206.8         
##  Mean   :931.9          Mean   :1001.7         Mean   :207.4         
##  3rd Qu.:936.0          3rd Qu.:1008.9         3rd Qu.:208.7         
##  Max.   :946.0          Max.   :1175.3         Max.   :227.4         
##  NA's   :1              NA's   :1              NA's   :2             
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##  Min.   :177.0          Min.   :177.0          Min.   :38.89         
##  1st Qu.:177.0          1st Qu.:177.0          1st Qu.:44.89         
##  Median :177.0          Median :178.0          Median :45.73         
##  Mean   :177.5          Mean   :177.6          Mean   :45.66         
##  3rd Qu.:178.0          3rd Qu.:178.0          3rd Qu.:46.52         
##  Max.   :178.0          Max.   :178.0          Max.   :49.36         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##  Min.   : 7.500         Min.   : 7.500         Min.   :   0.0        
##  1st Qu.: 8.700         1st Qu.: 9.000         1st Qu.:   0.0        
##  Median : 9.100         Median : 9.400         Median :   0.0        
##  Mean   : 9.179         Mean   : 9.386         Mean   : 857.8        
##  3rd Qu.: 9.550         3rd Qu.: 9.900         3rd Qu.:   0.0        
##  Max.   :11.600         Max.   :11.500         Max.   :4549.0        
##  NA's   :9              NA's   :10             NA's   :1             
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##  Min.   :32.10          Min.   :4701           Min.   :5904          
##  1st Qu.:33.90          1st Qu.:4828           1st Qu.:6010          
##  Median :34.60          Median :4856           Median :6032          
##  Mean   :34.51          Mean   :4854           Mean   :6039          
##  3rd Qu.:35.20          3rd Qu.:4882           3rd Qu.:6061          
##  Max.   :38.60          Max.   :5055           Max.   :6233          
##                         NA's   :1                                    
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##  Min.   :   0           Min.   :31.30          Min.   :   0          
##  1st Qu.:4561           1st Qu.:33.50          1st Qu.:4813          
##  Median :4588           Median :34.40          Median :4835          
##  Mean   :4566           Mean   :34.34          Mean   :4810          
##  3rd Qu.:4619           3rd Qu.:35.10          3rd Qu.:4862          
##  Max.   :4852           Max.   :40.00          Max.   :4971          
##                                                                      
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##  Min.   :5890           Min.   :   0           Min.   :-1.8000       
##  1st Qu.:6001           1st Qu.:4553           1st Qu.:-0.6000       
##  Median :6022           Median :4582           Median :-0.3000       
##  Mean   :6028           Mean   :4556           Mean   :-0.1642       
##  3rd Qu.:6050           3rd Qu.:4610           3rd Qu.: 0.0000       
##  Max.   :6146           Max.   :4759           Max.   : 3.6000       
##                                                                      
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##  Min.   : 0.000         Min.   :0.000          Min.   : 0.000        
##  1st Qu.: 3.000         1st Qu.:2.000          1st Qu.: 4.000        
##  Median : 5.000         Median :3.000          Median : 8.000        
##  Mean   : 5.406         Mean   :3.017          Mean   : 8.834        
##  3rd Qu.: 8.000         3rd Qu.:4.000          3rd Qu.:14.000        
##  Max.   :12.000         Max.   :6.000          Max.   :23.000        
##  NA's   :1              NA's   :1              NA's   :1             
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##  Min.   :   0           Min.   :   0           Min.   :   0          
##  1st Qu.:4832           1st Qu.:6020           1st Qu.:4560          
##  Median :4855           Median :6047           Median :4587          
##  Mean   :4828           Mean   :6016           Mean   :4563          
##  3rd Qu.:4877           3rd Qu.:6070           3rd Qu.:4609          
##  Max.   :4990           Max.   :6161           Max.   :4710          
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##  Min.   : 0.000         Min.   : 0.00          Min.   : 0.000        
##  1st Qu.: 0.000         1st Qu.:19.70          1st Qu.: 8.800        
##  Median :10.400         Median :19.90          Median : 9.100        
##  Mean   : 6.592         Mean   :20.01          Mean   : 9.161        
##  3rd Qu.:10.750         3rd Qu.:20.40          3rd Qu.: 9.700        
##  Max.   :11.500         Max.   :22.00          Max.   :11.200        
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##  Min.   : 0.00          Min.   :143.0          Min.   :56.00         
##  1st Qu.:70.10          1st Qu.:155.0          1st Qu.:62.00         
##  Median :70.80          Median :158.0          Median :64.00         
##  Mean   :70.18          Mean   :158.5          Mean   :63.54         
##  3rd Qu.:71.40          3rd Qu.:162.0          3rd Qu.:65.00         
##  Max.   :72.50          Max.   :173.0          Max.   :70.00         
##  NA's   :5                                     NA's   :5             
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##  Min.   :2.300          Min.   :463.0          Min.   :0.01700       
##  1st Qu.:2.500          1st Qu.:490.0          1st Qu.:0.01900       
##  Median :2.500          Median :495.0          Median :0.02000       
##  Mean   :2.494          Mean   :495.6          Mean   :0.01957       
##  3rd Qu.:2.500          3rd Qu.:501.5          3rd Qu.:0.02000       
##  Max.   :2.600          Max.   :522.0          Max.   :0.02200       
##  NA's   :5              NA's   :5              NA's   :5             
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##  Min.   :0.000          Min.   :0.000          Min.   :0.000         
##  1st Qu.:0.700          1st Qu.:2.000          1st Qu.:7.100         
##  Median :1.000          Median :3.000          Median :7.200         
##  Mean   :1.014          Mean   :2.534          Mean   :6.851         
##  3rd Qu.:1.300          3rd Qu.:3.000          3rd Qu.:7.300         
##  Max.   :2.300          Max.   :3.000          Max.   :7.500         
##                                                                      
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##  Min.   :0.00000        Min.   :0.00000        Min.   : 0.00         
##  1st Qu.:0.00000        1st Qu.:0.00000        1st Qu.:11.40         
##  Median :0.00000        Median :0.00000        Median :11.60         
##  Mean   :0.01771        Mean   :0.02371        Mean   :11.21         
##  3rd Qu.:0.00000        3rd Qu.:0.00000        3rd Qu.:11.70         
##  Max.   :0.10000        Max.   :0.20000        Max.   :12.10         
##  NA's   :1              NA's   :1                                    
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##  Min.   : 0.0000        Min.   :0.000          Min.   :0.000         
##  1st Qu.: 0.6000        1st Qu.:1.800          1st Qu.:2.100         
##  Median : 0.8000        Median :1.900          Median :2.200         
##  Mean   : 0.9119        Mean   :1.805          Mean   :2.138         
##  3rd Qu.: 1.0250        3rd Qu.:1.900          3rd Qu.:2.300         
##  Max.   :11.0000        Max.   :2.100          Max.   :2.600         
## 
x_chem <- ChemicalManufacturingProcess |> select(-Yield)
y_chem <- ChemicalManufacturingProcess |> select(Yield)
  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect. 3.8).
x_chem |> pivot_longer(cols=colnames(x_chem)) |> filter(is.na(value)) |> group_by(name) |> summarise(cnt=n()) |> arrange(-cnt)
## # A tibble: 28 Ă— 2
##    name                     cnt
##    <chr>                  <int>
##  1 ManufacturingProcess03    15
##  2 ManufacturingProcess11    10
##  3 ManufacturingProcess10     9
##  4 ManufacturingProcess25     5
##  5 ManufacturingProcess26     5
##  6 ManufacturingProcess27     5
##  7 ManufacturingProcess28     5
##  8 ManufacturingProcess29     5
##  9 ManufacturingProcess30     5
## 10 ManufacturingProcess31     5
## # ℹ 18 more rows

There are a couple of easy methods within preProcess that can be used to apply transformation; however, to preserve the original shape of the data the K-Nearest Neighbors method will be used.

Let’s apply KNN to find the nearest neighbors for imputation:

imputed <- preProcess(x_chem,method=c('center','scale','knnImpute'),k=5)
x_chem_imp <- predict(imputed,x_chem)
set.seed(21)
chem_train <- caret::createDataPartition(y_chem$Yield, p = .70, list= FALSE)
x_train_chem <- x_chem_imp[chem_train,]
x_test_chem <- x_chem_imp[-chem_train,]
y_train_chem <- y_chem[chem_train,]
y_test_chem <- y_chem[-chem_train,]
  1. Which tree-based regression model gives the optimal resampling and test set performance?

CART

set.seed(45)

rpartTune <- train(x_train_chem, y_train_chem, 
                   method = "rpart2",
                   tuneLength = 10,
                   trControl = trainControl(method = "cv"))
## note: only 9 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 9 .
rpartPred <- predict(rpartTune,x_test_chem)

Bagged Tree

baggedTree <- ipredbagg(y_train_chem,x_train_chem)
baggedPred <- predict(baggedTree,x_test_chem)

Random Forest

rfTree <- randomForest(x_train_chem, y_train_chem)
rfPred <- predict(rfTree,x_test_chem)

Gradient Boosted

gbmModel <- gbm.fit(x_train_chem, y_train_chem, distribution = "gaussian")
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        3.5279             nan     0.0010    0.0024
##      2        3.5251             nan     0.0010    0.0024
##      3        3.5228             nan     0.0010    0.0030
##      4        3.5200             nan     0.0010    0.0028
##      5        3.5174             nan     0.0010    0.0025
##      6        3.5145             nan     0.0010    0.0027
##      7        3.5115             nan     0.0010    0.0025
##      8        3.5088             nan     0.0010    0.0024
##      9        3.5058             nan     0.0010    0.0023
##     10        3.5031             nan     0.0010    0.0028
##     20        3.4753             nan     0.0010    0.0027
##     40        3.4205             nan     0.0010    0.0031
##     60        3.3627             nan     0.0010    0.0023
##     80        3.3124             nan     0.0010    0.0022
##    100        3.2607             nan     0.0010    0.0020
gbmPred <- predict(gbmModel,x_test_chem)
## Using 100 trees...

Cubist

cubistChemTree <- cubist(x_train_chem, y_train_chem,committees=50)

cubistPred <- predict(cubistChemTree,x_test_chem)
pred_chem_eval <- rbind(postResample(rpartPred,y_test_chem),postResample(baggedPred,y_test_chem),postResample(rfPred,y_test_chem),postResample(gbmPred,y_test_chem),postResample(cubistPred,y_test_chem))

rownames(pred_chem_eval) <- c('Single Tree: CART','Bagged Tree','Random Forest','Gradient Boosting','Cubist')
knitr::kable(pred_chem_eval)
RMSE Rsquared MAE
Single Tree: CART 1.4255108 0.3928835 1.1118891
Bagged Tree 1.1128390 0.5970195 0.8512993
Random Forest 0.9288144 0.7473898 0.7048974
Gradient Boosting 1.6866519 0.3386843 1.3921380
Cubist 0.8396574 0.7713848 0.6937094

The best model across the metrics evaluated (RMSE, \(R^2\), and MAE) is the Cubist algorithm.

  1. Which predictors are most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?
cubist_imp <- varImp(cubistChemTree,scale=TRUE)
as.data.frame(cubist_imp) |> arrange(-Overall) |> head(10)
##                        Overall
## ManufacturingProcess32    45.5
## ManufacturingProcess17    23.0
## ManufacturingProcess13    21.5
## BiologicalMaterial03      20.5
## ManufacturingProcess09    18.0
## BiologicalMaterial06      15.5
## BiologicalMaterial02      12.0
## ManufacturingProcess39    10.5
## ManufacturingProcess26    10.5
## ManufacturingProcess33    10.5

The manufacturing processes are the most important predictors in the tree based model which is consistent with the other forms of modeling that were assessed for this data set. The linear and non-linear models had extremely similar variable importance predictors and rankings. The top predictor was the same for all of the models although ManufacturingProcess17 was more important in the tree based model whereas it was the fifth highest in the other two modeling type results. The tree based model also found that ManufacturingProcess09 to be of more significant than the other models and found manufacturing processes 39, 26 and 33 in the top 10.

SVM Variable Importance
SVM Variable Importance
  1. Plot the optimal single tree with the distribution of yield in the terminal nodes. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?
set.seed(10)
rpartTree <- rpart(Yield ~ ., data = cbind(y_chem,x_chem_imp))

plot(as.party(rpartTree),gp=gpar(fontsize = 10))

The top split is consistent with the variable importance although it isn’t necessarily representative of future unseen data when evaluating only one tree. There are more biological material splits than expected and certain predictors are in the tree that did not seem significant in other more robust modeling.