Homework 8

Exercises 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: \[ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2) \] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

1. Load Packages and Simulate Data

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: ggplot2

## Loading required package: lattice

library(earth)

## Warning: package 'earth' was built under R version 4.4.3

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.4.3

## Loading required package: plotrix

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.4.3

library(e1071)
library(nnet)

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

2. Train and Evaluate Multiple Models

Model 1: k-Nearest Neighbors (kNN)

set.seed(200)
knnModel <- train(
  x = trainingData$x, y = trainingData$y,
  method = "knn",
  preProcess = c("center", "scale"),
  tuneLength = 10
)

# Predict and evaluate on test set
knnPred <- predict(knnModel, testData$x)
knnPerf <- postResample(pred = knnPred, obs = testData$y)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.654912  0.4779838  2.958475
##    7  3.529432  0.5118581  2.861742
##    9  3.446330  0.5425096  2.780756
##   11  3.378049  0.5723793  2.719410
##   13  3.332339  0.5953773  2.692863
##   15  3.309235  0.6111389  2.663046
##   17  3.317408  0.6201421  2.678898
##   19  3.311667  0.6333800  2.682098
##   21  3.316340  0.6407537  2.688887
##   23  3.326040  0.6491480  2.705915
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

knnPerf

##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

Model 2: Multivariate Adaptive Regression Splines (MARS)

marsModel <- earth(trainingData$x, trainingData$y)

marsPred <- predict(marsModel, testData$x)
marsPerf <- postResample(pred = marsPred, obs = testData$y)
summary(marsModel)

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

marsPerf

##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836

summary(marsModel)

## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

Model 3: Support Vector Machine (RBF kernel)

set.seed(200)
svmModel <- train(
  x = trainingData$x, y = trainingData$y,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 14,
  trControl = trainControl(method = "cv")
)

svmPred <- predict(svmModel, testData$x)
svmPerf <- postResample(pred = svmPred, obs = testData$y)
svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.525164  0.7810576  2.010680
##      0.50  2.270567  0.7944850  1.794902
##      1.00  2.099356  0.8155574  1.659376
##      2.00  2.005858  0.8302852  1.578799
##      4.00  1.934650  0.8435677  1.528373
##      8.00  1.915665  0.8475605  1.528648
##     16.00  1.923914  0.8463074  1.535991
##     32.00  1.923914  0.8463074  1.535991
##     64.00  1.923914  0.8463074  1.535991
##    128.00  1.923914  0.8463074  1.535991
##    256.00  1.923914  0.8463074  1.535991
##    512.00  1.923914  0.8463074  1.535991
##   1024.00  1.923914  0.8463074  1.535991
##   2048.00  1.923914  0.8463074  1.535991
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 8.

svmPerf

##      RMSE  Rsquared       MAE 
## 2.0541197 0.8290353 1.5586411

Model 4: Bayesian Regularized Neural Network (brnn)

# install.packages("brnn")
library(brnn)

## Warning: package 'brnn' was built under R version 4.4.3

## Loading required package: truncnorm

## Warning: package 'truncnorm' was built under R version 4.4.3

set.seed(200)
brnnModel <- train(
  x = trainingData$x, y = trainingData$y,
  method = "brnn",
  tuneLength = 2,
  trControl = trainControl(method = "cv")
)

## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.431     alpha= 3.7963   beta= 13.2473 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.6038    alpha= 1.8541   beta= 20.8575 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4268    alpha= 3.2258   beta= 12.6518 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.5155    alpha= 2.0396   beta= 19.2694 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4276    alpha= 3.4891   beta= 14.9288 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.6846    alpha= 1.8075   beta= 24.6929 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4364    alpha= 3.7622   beta= 13.6761 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.6031    alpha= 1.9279   beta= 21.3088 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4478    alpha= 3.8207   beta= 13.5228 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.591     alpha= 1.9244   beta= 20.9807 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4275    alpha= 3.8611   beta= 12.1482 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.4857    alpha= 1.7943   beta= 17.8262 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.3978    alpha= 3.7528   beta= 12.8543 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.5352    alpha= 1.9119   beta= 19.751 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4079    alpha= 3.6747   beta= 13.2521 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.668     alpha= 1.7648   beta= 22.1225 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.4268    alpha= 3.6426   beta= 14.2758 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.6337    alpha= 2.0049   beta= 22.6481 
## Number of parameters (weights and biases) to estimate: 12 
## Nguyen-Widrow method
## Scaling factor= 0.7 
## gamma= 11.3897    alpha= 3.6907   beta= 13.6066 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7027008 
## gamma= 22.5906    alpha= 2.0089   beta= 21.6361 
## Number of parameters (weights and biases) to estimate: 24 
## Nguyen-Widrow method
## Scaling factor= 0.7024302 
## gamma= 22.68      alpha= 1.9052   beta= 21.4949

brnnPred <- predict(brnnModel, testData$x)
brnnPerf <- postResample(pred = brnnPred, obs = testData$y)
brnnModel

## Bayesian Regularized Neural Networks 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   neurons  RMSE      Rsquared   MAE     
##   1        2.407648  0.7593635  1.936012
##   2        1.959947  0.8290583  1.559422
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was neurons = 2.

brnnPerf

##     RMSE Rsquared      MAE 
## 2.293683 0.790615 1.699080

Model 5: Random Forest

set.seed(200)
rfModel <- train(
  x = trainingData$x, y = trainingData$y,
  method = "rf",
  tuneLength = 5,
  trControl = trainControl(method = "cv")
)

rfPred <- predict(rfModel, testData$x)
rfPerf <- postResample(pred = rfPred, obs = testData$y)
rfModel

## Random Forest 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    2.824419  0.8150845  2.332915
##    4    2.473928  0.8167031  2.044099
##    6    2.409182  0.8032251  2.001063
##    8    2.386901  0.7978693  1.965533
##   10    2.389432  0.7908445  1.953503
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 8.

rfPerf

##      RMSE  Rsquared       MAE 
## 2.4218795 0.7900169 1.9134177

3. Compare Model Performances

# Combine results into a data frame
results <- data.frame(
  Model = c("kNN", "MARS", "SVM (RBF)", "BRNN", "Random Forest"),
  RMSE = c(knnPerf["RMSE"], marsPerf["RMSE"], svmPerf["RMSE"], brnnPerf["RMSE"], rfPerf["RMSE"]),
  Rsquared = c(knnPerf["Rsquared"], marsPerf["Rsquared"], svmPerf["Rsquared"], brnnPerf["Rsquared"], rfPerf["Rsquared"]),
  MAE = c(knnPerf["MAE"], marsPerf["MAE"], svmPerf["MAE"], brnnPerf["MAE"], rfPerf["MAE"])
)

# View performance table
print(results)

##           Model     RMSE  Rsquared      MAE
## 1           kNN 3.175066 0.6785946 2.544317
## 2          MARS 1.813647 0.8677298 1.391184
## 3     SVM (RBF) 2.054120 0.8290353 1.558641
## 4          BRNN 2.293683 0.7906150 1.699080
## 5 Random Forest 2.421879 0.7900169 1.913418

# Optional: visualize performance
library(ggplot2)

results_long <- reshape2::melt(results, id.vars = "Model")

ggplot(results_long, aes(x = reorder(Model, value), y = value, fill = variable)) +
  geom_col(position = "dodge") +
  facet_wrap(~ variable, scales = "free") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Model Performance Comparison",
    x = "Model", y = "Metric Value"
  )

### Model-by-Model Analysis 1. k-Nearest Neighbors (kNN):

The kNN model, while simple and intuitive, had the weakest performance among all the models tested. It produced the highest RMSE (3.18) and MAE (2.54), and the lowest R² value (0.68), indicating that it struggled to capture the nonlinear structure of the data. This result is expected because kNN is highly sensitive to irrelevant or noisy features, and in the Friedman dataset, only the first five out of ten predictors are informative. Without a strong mechanism for feature selection or weighting, kNN treats all features equally, which likely diluted its predictive power. Additionally, kNN tends to perform poorly in high-dimensional spaces unless feature engineering or dimensionality reduction is applied beforehand.

Multivariate Adaptive Regression Splines (MARS):

MARS clearly outperformed all other models, achieving the lowest RMSE (1.81), lowest MAE (1.39), and highest R² (0.87). This demonstrates that it was best at approximating the complex, nonlinear relationships in the data. MARS works by fitting piecewise linear regressions and automatically selecting both relevant predictors and interaction terms, which aligns perfectly with the known structure of the Friedman function. Since only the first five variables truly contribute to the outcome, MARS’s embedded feature selection allowed it to ignore the irrelevant ones and focus its capacity on modeling the important relationships. Moreover, the model’s interpretability and its ability to visualize variable importance make it a strong candidate for both performance and explainability.

Support Vector Machine with Radial Basis Function Kernel (SVM-RBF):

SVM with an RBF kernel also performed well, with an RMSE of 2.05, MAE of 1.56, and R² of 0.83. This model is particularly adept at handling nonlinear relationships due to the RBF kernel, which maps input features into a higher-dimensional space where a linear separator can be found. While SVM did not match MARS in accuracy, it still captured the underlying structure reasonably well. The slight underperformance compared to MARS may be attributed to its sensitivity to tuning parameters like cost and gamma, or its inability to directly perform feature selection. Nevertheless, SVM remained a strong contender in terms of overall performance and generalization to unseen data.

Bayesian Regularized Neural Network (BRNN):

The BRNN model showed moderately strong results, with an RMSE of 2.29, MAE of 1.70, and R² of 0.79. Bayesian regularization helps prevent overfitting by controlling the model complexity through prior distributions on the weights. Although this helped the neural net generalize better than traditional backpropagation networks, its performance was still somewhat behind MARS and SVM. This may be due to the relatively small training set (n = 200), which can limit the potential of neural networks to learn complex functions unless paired with substantial regularization and tuning. Despite this, BRNN still demonstrated the ability to approximate nonlinear patterns better than kNN or Random Forest in this specific context.

Random Forest:

Random Forest, surprisingly, did not perform as well as expected. It had an RMSE of 2.42, an MAE of 1.91, and an R² of 0.79 — close to BRNN but noticeably worse than MARS and SVM. Random Forests are generally robust and handle noisy and irrelevant features better than many models. However, they can sometimes struggle when the signal-to-noise ratio is low or when too many trees average out important non-additive effects. One possibility here is that the model was under-tuned or that the trees overfit to less informative variables, leading to poorer generalization. While Random Forests are often a safe and strong baseline, in this case, it appears they did not capture the Friedman function as effectively as other models

CONCLUSION Based on the model comparison, Multivariate Adaptive Regression Splines (MARS) is the clear winner for this task. It not only delivered the most accurate predictions but also did so with a high level of model interpretability. MARS excelled because it aligns well with the structure of the Friedman dataset, where only a subset of features drive the response, and relationships are nonlinear and additive. SVM with RBF kernel also performed admirably, showing strong generalization capabilities and relatively low error, making it a good alternative in scenarios where interpretability is less critical.

On the other hand, kNN’s poor performance highlights the importance of accounting for feature relevance and dataset structure when choosing a model. Its equal treatment of all variables caused it to be misled by irrelevant features. BRNN and Random Forests fell into the middle tier of performance, both showing adequate modeling ability but falling short of the leading contenders, possibly due to insufficient tuning, data size, or sensitivity to feature noise.

Exercise 7.5.

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model gives the optimal resampling and testset performance?

# Load packages
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.3

library(caret)
library(earth)
library(e1071)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm)

## Warning: package 'gbm' was built under R version 4.4.3

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(nnet)

# Load dataset
data(ChemicalManufacturingProcess)
df <- ChemicalManufacturingProcess

# Impute missing values
for (col in names(df)) {
  df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
}

# Split data into train (75%) and test (25%)
set.seed(123)
split <- createDataPartition(df$Yield, p = 0.75, list = FALSE)
train <- df[split, ]
test <- df[-split, ]

x_train <- train[, -1]
y_train <- train$Yield
x_test <- test[, -1]
y_test <- test$Yield

# Set up training control
ctrl <- trainControl(method = "cv", number = 10)

# Model 1: k-Nearest Neighbors
set.seed(123)
knn_model <- train(x_train, y_train, method = "knn",
                   preProcess = c("center", "scale"),
                   tuneLength = 10, trControl = ctrl)
knn_pred <- predict(knn_model, x_test)
knn_perf <- postResample(knn_pred, y_test)

# Model 2: MARS
mars_model <- earth(x_train, y_train)
mars_pred <- predict(mars_model, x_test)
mars_perf <- postResample(mars_pred, y_test)

# Model 3: SVM (Radial)
set.seed(123)
svm_model <- train(x_train, y_train, method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 10, trControl = ctrl)
svm_pred <- predict(svm_model, x_test)
svm_perf <- postResample(svm_pred, y_test)

# Model 4: Random Forest
set.seed(123)
rf_model <- train(x_train, y_train, method = "rf",
                  importance = TRUE, tuneLength = 5, trControl = ctrl)
rf_pred <- predict(rf_model, x_test)
rf_perf <- postResample(rf_pred, y_test)

# Model 5: Gradient Boosting Machine
set.seed(123)
gbm_model <- train(x_train, y_train, method = "gbm",
                   verbose = FALSE, tuneLength = 5, trControl = ctrl)
gbm_pred <- predict(gbm_model, x_test)
gbm_perf <- postResample(gbm_pred, y_test)

# Compare model performance
model_results <- rbind(
  KNN = knn_perf,
  MARS = mars_perf,
  SVM = svm_perf,
  RandomForest = rf_perf,
  GBM = gbm_perf
)

print(model_results)

##                  RMSE  Rsquared       MAE
## KNN          1.370642 0.3934243 1.0947652
## MARS         1.221511 0.5774516 0.9370294
## SVM          1.121974 0.5882149 0.8789167
## RandomForest 1.210482 0.5215893 0.8796533
## GBM          1.145438 0.5758184 0.8532827

The optimal model based on the test set performance is the one with the lowest RMSE and highest Rsquared. In this case: * SVM has the lowest RMSE (1.121974) and the highest Rsquared (0.5882149). * GBM has a slightly higher RMSE but the lowest MAE (0.8532827).

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

So, SVM (Radial) is the optimal nonlinear model based on lowest RMSE and highest R². However, SVMs do not have built-in variable importance. So, we’ll get variable importance from the second-best nonlinear model (GBM), which provides it directly.

The code to extract and plot variable importance from the GBM model:

# Variable importance from GBM model
gbm_imp <- varImp(gbm_model, scale = FALSE)

# Plot top 20 important predictors
plot(gbm_imp, top = 20, main = "Top 20 Important Predictors (GBM)")

Now classify the top predictors as biological or process:

# Identify top 10 important predictors
top10_vars <- rownames(gbm_imp$importance)[order(gbm_imp$importance$Overall, decreasing = TRUE)][1:10]

# Count how many are biological vs process variables
bio_count <- sum(grepl("^BiologicalMaterial", top10_vars))
proc_count <- sum(grepl("^ManufacturingProcess", top10_vars))

cat("Biological predictors in top 10:", bio_count, "\n")

## Biological predictors in top 10: 4

cat("Process predictors in top 10:", proc_count, "\n")

## Process predictors in top 10: 6

The most important predictors identified by the GBM model include both biological material and manufacturing process variables. Among the top 10 predictors, there were X biological and Y procedss variables, suggesting that both raw material quality and adjustable process conditions significantly influence the yield.

The optimal nonlinear regression model based on the test set performance was the Support Vector Machine (SVM) with a radial kernel. However, since SVM does not provide direct variable importance measures, we examined the importance from the Gradient Boosting Machine (GBM) model, which was the second-best performing nonlinear model. Based on the output, among the top 10 most important predictors identified by the GBM model, there were 4 biological material variables and 6 manufacturing process variables. This suggests that both the quality of the raw materials and the conditions of the manufacturing process are influential in determining the product yield, with a slight dominance of process-related factors in the top 10 according to GBM.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# Select top 10 variables from GBM importance
top10_vars <- rownames(gbm_imp$importance)[order(gbm_imp$importance$Overall, decreasing = TRUE)][1:10]

# Combine with Yield
top_data <- data.frame(Yield = df$Yield, df[, top10_vars])

# Reshape for ggplot
library(tidyr)
library(ggplot2)

top_data_long <- pivot_longer(top_data, -Yield, names_to = "Predictor", values_to = "Value")

# Create plots
ggplot(top_data_long, aes(x = Value, y = Yield)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  facet_wrap(~ Predictor, scales = "free_x") +
  theme_minimal() +
  labs(title = "Yield vs Top GBM Predictors",
       x = "Predictor Value", y = "Yield")

## `geom_smooth()` using formula = 'y ~ x'

Visual inspection of the top predictors reveals nonlinear relationships with the Yield. Some variables show curved or threshold-like patterns, suggesting interaction effects or diminishing returns. These insights reinforce the use of nonlinear models like GBM and SVM for capturing complex dependencies in the data.

The attached plot, titled “Yield vs Top GBM Predictors,” displays scatter plots with LOESS smooth trend lines for the top 10 predictors identified by the GBM model against the product yield. Visual inspection of these plots reveals that many of the top predictors exhibit nonlinear relationships with the yield. Several plots show curves, indicating that the effect of these predictors on the yield is not constant across their entire range. For instance, some predictors might show an increasing yield up to a certain point, after which the yield either plateaus or decreases. These nonlinear patterns, such as those observed for BiologicalMaterial03, ManufacturingProcess17, and ManufacturingProcess30, suggest that the relationship between these specific biological or process factors and the final yield is complex and cannot be adequately captured by a simple linear model. This intuition reinforces the utility of employing nonlinear regression models like GBM and SVM to better understand and potentially optimize the chemical manufacturing process.