Introduction

This document presents solutions to a series of exercises from Chapter 7 in the book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. These exercises focus on non-linear regression techniques, including Neural Networks, Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The aim is to deepen understanding of how these models handle complex, non-linear relationships in predictive modeling tasks.

We will address the following problems step by step:

  1. Problem 7.2: This exercise involves simulating a nonlinear dataset and applying predictive models such as k-Nearest Neighbors (KNN) and Multivariate Adaptive Regression Splines (MARS). The goal is to evaluate model performance, tune hyperparameters, and analyze the ability of MARS to select informative predictors (those named X1–X5) from the dataset.

  2. Problem 7.5: This exercise explores the application of nonlinear regression models, specifically Support Vector Machines (SVM), k-Nearest Neighbors (KNN), and Multivariate Adaptive Regression Splines (MARS). The goal is to compare their performance in predicting a continuous response (Yield) while identifying the most important predictors. The exercise emphasizes the critical role of hyperparameter tuning, data preprocessing, and visualization to interpret model insights and optimize predictive performance.

Each solution is accompanied by R code, results, and discussions:


Problem 7.2

7.2.a. Initial Setup and Data Simulation

Use the mlbench.friedman1() function from the mlbench package to create the training (200 samples) and test datasets (5000 samples). Include non-informative predictors to evaluate model robustness.

# Load necessary libraries
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.2
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: lattice
library(earth)
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
# Simulate Friedman1 dataset
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
testData <- mlbench.friedman1(5000, sd = 1)

# Convert matrices to data frames
trainingData$x <- data.frame(trainingData$x)
testData$x <- data.frame(testData$x)

# Check dataset structure
str(trainingData$x)
## 'data.frame':    200 obs. of  10 variables:
##  $ X1 : num  0.534 0.584 0.59 0.691 0.667 ...
##  $ X2 : num  0.648 0.438 0.588 0.226 0.819 ...
##  $ X3 : num  0.8508 0.6727 0.4097 0.0334 0.7168 ...
##  $ X4 : num  0.1816 0.6692 0.3381 0.0669 0.8032 ...
##  $ X5 : num  0.929 0.1638 0.8941 0.6374 0.0831 ...
##  $ X6 : num  0.3618 0.4531 0.0268 0.525 0.2234 ...
##  $ X7 : num  0.827 0.649 0.179 0.513 0.664 ...
##  $ X8 : num  0.421 0.845 0.35 0.797 0.904 ...
##  $ X9 : num  0.5911 0.9282 0.0176 0.6899 0.397 ...
##  $ X10: num  0.589 0.758 0.444 0.445 0.55 ...

In this step, we simulated data using the mlbench.friedman1 function from the mlbench package. This function generates a dataset based on the Friedman1 equation:

\[ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2) \]

where: - \(x_1, x_2, \ldots, x_{10}\) are predictors. - Only \(x_1\) through \(x_5\) are informative, while \(x_6\) through \(x_{10}\) are non-informative (random noise).

Key Observations

  1. Training Dataset:
    • Contains 200 observations with 10 predictors.
    • Predictors \(X_1\) to \(X_{10}\) are uniformly distributed between 0 and 1.
  2. Test Dataset:
    • Contains 5000 observations for reliable estimation of test error.
    • Helps assess the generalization of the models.

Purpose The primary goal of this simulation is to provide a controlled environment to test regression models, evaluating their ability to: - Identify the relationships between the predictors and the response. - Handle the presence of irrelevant predictors (\(X_6\)–\(X_{10}\)).

This setup is critical for exploring the performance of nonlinear regression methods like KNN, MARS, and others in subsequent steps.

7.2.b. Fit and Tune the KNN Model

Fit a KNN model using caret with bootstrap resampling and tune the number of neighbors (k).

# KNN model with preprocessing
set.seed(123)
knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10,
  trControl = trainControl(method = "boot", number = 25)
)

# Print KNN model results
print(knnModel)
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.548433  0.4919564  2.888283
##    7  3.425531  0.5255725  2.778090
##    9  3.346026  0.5523023  2.704791
##   11  3.252313  0.5875603  2.620492
##   13  3.232552  0.6000482  2.601113
##   15  3.205067  0.6203296  2.586704
##   17  3.172791  0.6408339  2.566738
##   19  3.183306  0.6494300  2.587220
##   21  3.190873  0.6556293  2.596793
##   23  3.202234  0.6597746  2.604279
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
# Predict on the test set
knnPred <- predict(knnModel, newdata = testData$x)

# Evaluate performance
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

In this step, we trained and tuned a k-Nearest Neighbors (KNN) model on the Friedman1 dataset using the caret package. The following steps summarize the procedure and results:

  1. Preprocessing:
    • The predictors were standardized by centering and scaling to ensure that distance calculations in KNN were not influenced by the scale of the variables.
  2. Model Training and Tuning:
    • The model was trained with bootstrap resampling (25 repetitions) to ensure robust estimation of performance metrics.
    • A grid search was performed to tune the number of neighbors (\(k\)) from a range of values, using RMSE as the selection criterion.
  3. Results:
    • The optimal value of \(k\) was found to be 17, which minimized the RMSE.
    • Key performance metrics on the test dataset:
      • RMSE: 3.2041
      • R-squared: 0.6819
      • MAE: 2.5683
  4. Observations:
    • As \(k\) increased, RMSE generally decreased, stabilizing around \(k = 17\) and beyond.
    • The R-squared value (\(0.6819\)) indicates that the KNN model explains approximately 68.19% of the variability in the response variable.
    • The model’s performance on the test dataset is consistent with the resampling results, highlighting the reliability of the model.

In summary:The KNN model effectively captured the nonlinear relationships in the Friedman1 data. The optimal number of neighbors (\(k = 17\)) provides a balance between model complexity and generalization, as evidenced by the low RMSE and high R-squared on the test data.

7.2.c. Fit and Tune the MARS Model

Fit a MARS model using the earth method in caret. Tune hyperparameters, such as the number of terms and degree of interactions.

# MARS model with preprocessing
set.seed(123)
marsModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth",
  tuneLength = 10,
  trControl = trainControl(method = "boot", number = 25)
)

# Print MARS model results
print(marsModel)
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE     
##    2      4.379381  0.2301740  3.575902
##    3      3.649438  0.4583683  2.944879
##    4      2.769352  0.6876944  2.223704
##    6      2.366383  0.7734368  1.888582
##    7      1.988717  0.8380231  1.581362
##    9      1.827116  0.8637619  1.443208
##   10      1.788268  0.8690065  1.410531
##   12      1.815936  0.8656814  1.422587
##   13      1.824463  0.8644827  1.433229
##   15      1.856755  0.8590033  1.460392
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 1.
# Predict on the test set
marsPred <- predict(marsModel, newdata = testData$x)

# Evaluate performance
postResample(pred = marsPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 1.776575 0.872700 1.358367

In this step, we trained and tuned a Multivariate Adaptive Regression Splines (MARS) model on the Friedman1 dataset using the earth method from the caret package. The procedure and results are summarized below:

  1. Model Training and Tuning:
    • The MARS model was trained with bootstrap resampling (25 repetitions) to ensure robust evaluation of performance metrics.
    • The number of terms (nprune) and the degree of interactions (degree) were tuned using RMSE as the criterion.
  2. Results:
    • The optimal values were:
      • nprune: 10 (number of terms)
      • degree: 1 (interaction degree)
    • Performance metrics on the test dataset:
      • RMSE: 1.7766
      • R-squared: 0.8772
      • MAE: 1.3584
  3. Observations:
    • The MARS model achieved the best performance among the models tested so far, with the lowest RMSE and highest R-squared.
    • The high R-squared value (\(0.8772\)) indicates that the MARS model explains approximately 87.72% of the variability in the response variable.
    • The relatively low RMSE (\(1.7766\)) and MAE (\(1.3584\)) on the test dataset highlight the model’s predictive accuracy.

Conclusion: The MARS model demonstrated superior performance compared to the KNN model, effectively capturing the complex nonlinear relationships in the Friedman1 data. This is likely due to its ability to identify important variables and interactions, making it a robust choice for modeling such data.

7.2.d. Wrap-Up Discussion

In this section, we summarize the results of Problem 7.2 and directly address the question: Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Model Performance Comparison

  1. KNN Model:
    • Performance Metrics (Test Data):
      • RMSE: 3.2041
      • R-squared: 0.6819
      • MAE: 2.5684
    • Observation:
      • While KNN performs reasonably well, the higher RMSE and lower R-squared indicate it struggles to fully capture the non-linear relationships in the data.
  2. MARS Model:
    • Performance Metrics (Test Data):
      • RMSE: 1.7766
      • R-squared: 0.8772
      • MAE: 1.3584
    • Observation:
      • MARS significantly outperforms KNN in all metrics, with a much lower RMSE and higher R-squared. This indicates that MARS is better suited for handling the non-linear dependencies in the simulated dataset.

Conclusion: The MARS model gives the best performance based on its superior test metrics.

Informative Predictors (X1–X5)

  • The dataset was generated using the equation: \[ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2) \] where \(X1\)–\(X5\) are the informative predictors and \(X6\)–\(X10\) are non-informative noise predictors.

  • To evaluate whether MARS selects the informative predictors, we examined the terms selected in the final model:

  summary(marsModel$finalModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=1,
##             nprune=10)
## 
##                coefficients
## (Intercept)       20.395804
## h(0.621722-X1)   -10.925741
## h(0.601063-X2)   -10.668385
## h(X3-0.281766)     3.966649
## h(0.447442-X3)    12.392139
## h(X3-0.636458)     7.640411
## h(0.734892-X4)    -9.900621
## h(X4-0.734892)    10.274706
## h(0.850094-X5)    -5.343409
## h(X6-0.361791)    -1.769825
## 
## Selected 10 of 18 terms, and 6 of 10 predictors (nprune=10)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 2.731203    RSS 447.3848    GRSq 0.889112    RSq 0.9082649

Final Answer

Which model gives the best performance?

  • The MARS model provides the best performance, with the lowest RMSE (1.7766) and highest R-squared (0.8772).

Does MARS select the informative predictors (X1–X5)?

  • Yes, MARS successfully identifies and utilizes X1–X5 as the main informative predictors while excluding noise predictors (X7–X10). The partial inclusion of X6 does not detract from its effectiveness.

Problem 7.5

data for a chemical manufacturing process preprocessed as in Exercise 6.3:

# Load the required library and dataset
library(AppliedPredictiveModeling)
library(caret)

# Load dataset
data(ChemicalManufacturingProcess)

# Separate predictors and response
predictors <- ChemicalManufacturingProcess[, -1] # Exclude the first column (Yield)
response <- ChemicalManufacturingProcess$Yield

# Handle missing values using median imputation
preprocess <- preProcess(predictors, method = "medianImpute")
imputed_predictors <- predict(preprocess, predictors)

# Split the data into training (80%) and testing (20%) sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(response, p = 0.8, list = FALSE)

# Create training and testing sets
train_predictors <- imputed_predictors[train_index, ]
test_predictors <- imputed_predictors[-train_index, ]
train_response <- response[train_index]
test_response <- response[-train_index]

7.5.a Train Nonlinear Regression Models

7.5.a.1 Train MARS Model

# Train a MARS model using the 'earth' method
set.seed(123)
mars_model <- train(
  x = train_predictors,
  y = train_response,
  method = "earth",
  tuneLength = 10,  # Tune over 10 values of hyperparameters
  trControl = trainControl(method = "cv", number = 10)  # 10-fold cross-validation
)

# Print MARS model details
mars_model
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  57 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE      
##    2      1.378691  0.4988104  1.1236683
##    3      1.218983  0.6300794  0.9904402
##    5      1.292456  0.5640833  1.0531348
##    7      1.352366  0.5463897  1.1064317
##    8      1.328867  0.5530096  1.0965782
##   10      1.403699  0.5200030  1.1205506
##   12      1.422694  0.4938241  1.1282104
##   13      1.423642  0.4954649  1.1182360
##   15      1.432566  0.5021617  1.1246753
##   17      1.432889  0.5050105  1.1235802
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
# Predict and evaluate on the test set
mars_pred <- predict(mars_model, newdata = test_predictors)
mars_perf <- postResample(pred = mars_pred, obs = test_response)
mars_perf
##      RMSE  Rsquared       MAE 
## 1.3765957 0.4680622 1.1161381

7.5.a.2 Train SVM Model

library(kernlab)
## Warning: package 'kernlab' was built under R version 4.3.3
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
# Train an SVM model using radial kernel
set.seed(123)
svm_model <- train(
  x = train_predictors,
  y = train_response,
  method = "svmRadial",
  tuneLength = 10,  # Tune over 10 values of cost and sigma
  preProcess = c("center", "scale"),  # Preprocessing
  trControl = trainControl(method = "cv", number = 10)  # 10-fold cross-validation
)

# Print SVM model details
svm_model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE      
##     0.25  1.357053  0.5717606  1.1060197
##     0.50  1.250562  0.6057450  1.0163217
##     1.00  1.161474  0.6434975  0.9343628
##     2.00  1.111380  0.6638872  0.8952385
##     4.00  1.099131  0.6618941  0.8755585
##     8.00  1.080413  0.6781477  0.8590074
##    16.00  1.078416  0.6806858  0.8582669
##    32.00  1.078416  0.6806858  0.8582669
##    64.00  1.078416  0.6806858  0.8582669
##   128.00  1.078416  0.6806858  0.8582669
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01452627
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01452627 and C = 16.
# Predict and evaluate on the test set
svm_pred <- predict(svm_model, newdata = test_predictors)
svm_perf <- postResample(pred = svm_pred, obs = test_response)
svm_perf
##      RMSE  Rsquared       MAE 
## 1.2222178 0.5568576 1.0260355

7.5.a.3 Train KNN Model

# Train a KNN model
set.seed(123)
knn_model <- train(
  x = train_predictors,
  y = train_response,
  method = "knn",
  tuneLength = 10,  # Tune over 10 values of 'k'
  preProcess = c("center", "scale"),  # Preprocessing
  trControl = trainControl(method = "cv", number = 10)  # 10-fold cross-validation
)

# Print KNN model details
knn_model
## k-Nearest Neighbors 
## 
## 144 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    5  1.234126  0.5855017  0.9638509
##    7  1.292584  0.5489661  1.0408474
##    9  1.294423  0.5543285  1.0511925
##   11  1.302680  0.5447302  1.0603173
##   13  1.321252  0.5210424  1.0727730
##   15  1.346144  0.5052424  1.1017134
##   17  1.350765  0.5084828  1.0910419
##   19  1.366269  0.4950894  1.1099770
##   21  1.374749  0.4950063  1.1181499
##   23  1.392496  0.4843170  1.1313867
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Predict and evaluate on the test set
knn_pred <- predict(knn_model, newdata = test_predictors)
knn_perf <- postResample(pred = knn_pred, obs = test_response)
knn_perf
##      RMSE  Rsquared       MAE 
## 1.4063326 0.4224466 1.1606875

7.5.a.4 Compare Model Performance

# Create a summary table of performance metrics
performance_results <- data.frame(
  Model = c("MARS", "SVM", "KNN"),
  RMSE = c(mars_perf["RMSE"], svm_perf["RMSE"], knn_perf["RMSE"]),
  Rsquared = c(mars_perf["Rsquared"], svm_perf["Rsquared"], knn_perf["Rsquared"])
)

# Print the comparison table
performance_results
##   Model     RMSE  Rsquared
## 1  MARS 1.376596 0.4680622
## 2   SVM 1.222218 0.5568576
## 3   KNN 1.406333 0.4224466

7.5.a.5 Which Nonlinear Regression Model Gives the Optimal Resampling and Test Set Performance?

Based on the results, the performance of the three nonlinear regression models—MARS, SVM, and KNN—was evaluated using RMSE and R-squared metrics. The findings are summarized as follows:

  1. Multivariate Adaptive Regression Splines (MARS):
    • RMSE: 1.3766
    • R-squared: 0.4681
    • The MARS model demonstrated moderate predictive capability with the highest RMSE among the three models. Its R-squared value indicates it captured some variance but less effectively compared to SVM.
  2. Support Vector Machines (SVM):
    • RMSE: 1.2222
    • R-squared: 0.5569
    • The SVM model achieved the best performance in this comparison. It had the lowest RMSE, indicating higher accuracy in predictions, and the highest R-squared, reflecting a better explanation of the variance in the response variable.
  3. k-Nearest Neighbors (KNN):
    • RMSE: 1.4063
    • R-squared: 0.4224
    • The KNN model showed the lowest predictive performance among the three models, with the highest RMSE and the lowest R-squared. This indicates that the KNN model was not effective for this dataset.

Conclusion: Among the three models, the SVM model provided the optimal resampling and test set performance. It achieved the lowest RMSE, indicating better prediction accuracy, and the highest R-squared, suggesting a stronger explanatory power for the response variable. Thus, SVM is the recommended nonlinear regression model for this dataset.

7.5.b Identify Most Important Predictors in SVM (optimal nonlinear regres- sion model)

# Variable importance for the SVM model
varImp(svm_model)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     94.06
## ManufacturingProcess36   81.54
## BiologicalMaterial03     81.27
## ManufacturingProcess13   80.63
## ManufacturingProcess31   78.52
## BiologicalMaterial02     76.04
## ManufacturingProcess17   75.92
## ManufacturingProcess09   73.04
## BiologicalMaterial12     69.48
## ManufacturingProcess06   66.24
## BiologicalMaterial11     59.72
## ManufacturingProcess33   57.06
## ManufacturingProcess29   54.40
## BiologicalMaterial04     53.93
## BiologicalMaterial01     45.62
## BiologicalMaterial08     44.93
## ManufacturingProcess30   42.47
## BiologicalMaterial09     40.88
## ManufacturingProcess11   38.38

7.5.b.1 Importance of Predictors in the Optimal Nonlinear Regression Model

The importance of predictors in the optimal nonlinear regression model, identified using the Support Vector Machines (SVM) model with a Radial Basis Function kernel, was evaluated. The results indicate that both biological and process variables contribute significantly to the model. Among the top ten most important predictors, a mix of variables from both categories is observed.

Biological vs. Process Variables From the ranked list of the top ten predictors: 1. The most important predictor is a process variable (ManufacturingProcess32) with a relative importance of 100%.

  1. Biological variables, such as BiologicalMaterial06 and BiologicalMaterial03, rank second and fourth, respectively, with relative importance values of 94.06% and 81.27%.

  2. Process variables, such as ManufacturingProcess36 and ManufacturingProcess13, also rank high with importance values of 81.54% and 80.63%.

Overall, the results suggest a balanced dominance of both biological and process variables among the most significant predictors. While ManufacturingProcess32 holds the highest relative importance, biological variables also rank prominently, highlighting their critical role in explaining the response variable in this manufacturing process.

7.5.b.2 Conclusion

In the optimal nonlinear regression model (SVM), both biological and process variables are critical. Process variables slightly dominate the top positions, but biological variables are also key contributors, indicating that successful modeling of the response depends on capturing interactions across both types of predictors.

7.5.c Visualize and Interpret Relationships

suppressMessages({
suppressWarnings({
# Visualize relationships using ggplot2 for the top 10 most important predictors from the SVM model
library(ggplot2)
library(gridExtra)

# Extract top 10 SVM-selected predictors
selected_predictors <- c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36", 
                         "BiologicalMaterial03", "ManufacturingProcess13", "ManufacturingProcess31", 
                         "BiologicalMaterial02", "ManufacturingProcess17", "ManufacturingProcess09", 
                         "BiologicalMaterial12")

# Create a list to store plots
plot_list <- list()

# Loop through each selected predictor to create individual scatter plots
for (predictor in selected_predictors) {
  plot_list[[predictor]] <- ggplot(data = ChemicalManufacturingProcess, aes_string(x = predictor, y = "Yield")) +
    geom_point(color = "darkorange", alpha = 0.6) +
    geom_smooth(method = "lm", color = "blue", se = FALSE) +
    ggtitle(paste("Relationship between", predictor, "and Yield")) +
    theme_minimal() +
    labs(x = predictor, y = "Yield") +
    theme(plot.title = element_text(size = 5, face = 'bold'),
          axis.title = element_text(size = 7),
          axis.text = element_text(size = 5)) # Adjust the size as needed

}

# Combine all plots into a grid
grid.arrange(grobs = plot_list, ncol = 3)
})
})

Analysis of Relationships Between Predictors and Yield:

The visualizations of the top 10 predictors from the SVM model (optimal nonlinear regression model) against the response variable (Yield) reveal distinct patterns and relationships.

  1. Process Predictors:
    • Predictors such as ManufacturingProcess32, ManufacturingProcess36, and ManufacturingProcess13 exhibit linear or slightly nonlinear trends with Yield.
    • For example, ManufacturingProcess32 shows a positive trend with Yield, suggesting that higher values in this process parameter might correspond to increased yield.
    • ManufacturingProcess13 and ManufacturingProcess31, on the other hand, display negative trends, indicating a possible inverse relationship between these process parameters and yield.
  2. Biological Predictors:
    • Predictors such as BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial02 show varying degrees of correlation with Yield.
    • BiologicalMaterial06 demonstrates a positive trend, suggesting that its increased presence positively impacts the yield.
    • The trend for BiologicalMaterial03 is positive but less pronounced, indicating a moderate association with yield.
  3. Intuition and Interpretation:
    • The biological predictors generally show positive relationships with yield, which aligns with the intuition that higher-quality or higher concentrations of biological materials might enhance manufacturing efficiency.
    • The process predictors, however, show mixed relationships. While some, like ManufacturingProcess32, have positive effects, others, like ManufacturingProcess13, exhibit negative trends. This suggests that the optimization of process parameters might involve trade-offs or thresholds where the process becomes less efficient.
  4. Key Observations:
    • The insights provided by the SVM model emphasize the significance of both biological and process variables. However, the visualizations reveal that biological predictors tend to positively influence the yield more consistently compared to process predictors, which have both positive and negative impacts.