Refer to Exercises 6.3 and 7.5 which describe a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several tree-based models:
Part A
Which tree-based regression model gives the optimal resampling and test set performance?
Cubist
cube_param <- expand.grid(committees = seq(1,10,by=1),
neighbors = seq(1,9,by=2))
cube_ctrl <- trainControl(method="cv", n=10)
d1 <- chem_train %>% select(Yield)
d2 <- chem_train %>% select(-Yield)
m_cube <- train(x=d2, y=d1$Yield, method="cubist",
trControl = cube_ctrl,
tuneGrid = cube_param,
verbose=FALSE)
m_cube$bestTune
## committees neighbors
## 27 6 3
| .metric |
.estimator |
.estimate |
| rmse |
standard |
1.2645557 |
| rsq |
standard |
0.5274334 |
Random Forest
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 57 223 6 rmse standard 1.13 10 0.0972 Model012
## 2 57 1555 2 rmse standard 1.14 10 0.0991 Model008
## 3 57 2000 2 rmse standard 1.14 10 0.0956 Model010
## 4 57 1333 2 rmse standard 1.14 10 0.0953 Model007
## 5 57 1777 2 rmse standard 1.14 10 0.0959 Model009
| .metric |
.estimator |
.estimate |
| rmse |
standard |
1.1850286 |
| rsq |
standard |
0.6062937 |
XGBoost
xgb <- boost_tree(mode = "regression",
mtry = tune(), trees = tune(),
min_n = 10, tree_depth = 8,
learn_rate = tune(),
loss_reduction = tune(),
sample_size = .60,
stop_iter = 3) %>% set_engine("xgboost")
xgb_tune <- grid_regular(finalize(mtry(),select(chem_train,-Yield)),
trees(),
learn_rate(), loss_reduction(),
levels = 10)
xgb_wf <- workflow() %>% add_model(xgb) %>% add_recipe(chem_rec)
xgb_fit <- xgb_wf %>% tune_grid(resamples = folds, grid = xgb_tune)
## # A tibble: 5 x 10
## mtry trees learn_rate loss_reduction .metric .estimator mean n std_err
## <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 7 1111 0.1 0.000000681 rmse standard 1.03 10 0.0606
## 2 7 1333 0.1 0.000000681 rmse standard 1.03 10 0.0606
## 3 7 1555 0.1 0.000000681 rmse standard 1.03 10 0.0606
## 4 7 1777 0.1 0.000000681 rmse standard 1.03 10 0.0606
## 5 7 2000 0.1 0.000000681 rmse standard 1.03 10 0.0606
## # ... with 1 more variable: .config <chr>
| .metric |
.estimator |
.estimate |
| rmse |
standard |
1.278147 |
| rsq |
standard |
0.521448 |
The XGBoost Model was the top performer.
Part B
Which predictors are the most important in the optimal tree-based regression model? Do either the biological or process variables dominate the list? How do the top 10 important predictors compare to the top 10 predictors from the optimal linear and nonlinear models?

Similar to the earlier exercise ManufacturingProcess32 was the most important predictor. The Top 10 variables have overlap but are not exactly the same.
Part C
Plot the optimal single tree with the distribution of yield in the terminals. Does this view of the data provide additional knowledge about the biological or process predictors and their relationship with yield?

The plot above shows similar results to the XGBoost model - M32 was the top in both. Additionally B12 and B06 were prominent in the XGBoost model and the tree model above.