library(gt)
library(mlbench)
library(caret)
library(skimr)
library(AppliedPredictiveModeling)
library(rpart)
library(tidyverse)
library(tidymodels)
library(vip)
library(ggthemes)
library(randomForest)
library(gbm)
library(party)
library(Cubist)
  1. Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:

Data From 6.3

Training and Test Split

Preprocessing From 6.3

Create Folds

Part A

Which tree-based regression model gives the optimal resampling and test set performance?

Cubist

##    committees neighbors
## 27          6         3
.metric .estimator .estimate
rmse standard 1.2645557
rsq standard 0.5274334

Random Forest

## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config 
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   
## 1    57   223     6 rmse    standard    1.13    10  0.0972 Model012
## 2    57  1555     2 rmse    standard    1.14    10  0.0991 Model008
## 3    57  2000     2 rmse    standard    1.14    10  0.0956 Model010
## 4    57  1333     2 rmse    standard    1.14    10  0.0953 Model007
## 5    57  1777     2 rmse    standard    1.14    10  0.0959 Model009
.metric .estimator .estimate
rmse standard 1.1850286
rsq standard 0.6062937

XGBoost

## # A tibble: 5 x 10
##    mtry trees learn_rate loss_reduction .metric .estimator  mean     n std_err
##   <int> <int>      <dbl>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1     7  1111        0.1    0.000000681 rmse    standard    1.03    10  0.0606
## 2     7  1333        0.1    0.000000681 rmse    standard    1.03    10  0.0606
## 3     7  1555        0.1    0.000000681 rmse    standard    1.03    10  0.0606
## 4     7  1777        0.1    0.000000681 rmse    standard    1.03    10  0.0606
## 5     7  2000        0.1    0.000000681 rmse    standard    1.03    10  0.0606
## # ... with 1 more variable: .config <chr>
.metric .estimator .estimate
rmse standard 1.278147
rsq standard 0.521448

The XGBoost Model was the top performer.

Part B

Which predictors are the most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Similar to the earlier exercise ManufacturingProcess32 was the most important predictor. The Top 10 variables have overlap but are not exactly the same.

Part C

Plot the optimal single tree with the distribution of yield in the terminals. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The plot above shows similar results to the XGBoost model - M32 was the top in both. Additionally B12 and B06 were prominent in the XGBoost model and the tree model above.