Imports

library(gridExtra)
library(ggplot2)
library(cowplot)
options(scipen=10000)
library(mlbench)
library(tidyverse)
library(corrplot)
library(AppliedPredictiveModeling)
library(caret)
library(DataExplorer)
library(kableExtra)
library(mice)
library(pls)

7.2 Simulated Data

Friedman (1991) introduced several benchmark data sets create by simulation, where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

set.seed(200)

trainingData = mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x = data.frame(trainingData$x)

testData = mlbench.friedman1(5000, sd = 1)
testData$x = data.frame(testData$x)
## Look at the data using
simDataset <- trainingData$x
simDataset$y <- trainingData$y
simData <- simDataset |> pivot_longer(cols = -y,names_to = "Variable",values_to = "Value")

ggplot( simData, aes(x = Value, y = y)) +

  geom_point() +
  facet_wrap(~Variable, scales = "free_x",ncol = 3) +
  theme_minimal()

correlations1 <- cor(simDataset)
corrplot::corrplot(correlations1, order = "hclust")

The autocorrelation matrix indicates that X1 and X4 are the strongest predictors of the target variable y, with positive correlations of 0.509 and 0.517, respectively. X2 (0.474) and X5 (0.356) also show moderate positive correlations with y, suggesting they are secondary but relevant predictors. The remaining variables (X3, X6, X7, X8, X9, and X10) exhibit very weak correlations with y, indicating minimal direct influence on the target.

Most predictor variables have low inter-correlations (|r| < 0.2), suggesting low multicollinearity and unique contributions from each variable. An exception is a modest negative correlation between X6 and X8 (-0.187), but it does not suggest redundancy.

1. Tune several models on these data.

KNN

knnModel <- train(x = trainingData$x, y = trainingData$y,
                  method = "knn", 
                  preProc = c("center", "scale"), 
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

MARS

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)

marsTuned <- train(trainingData$x, trainingData$y, method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.327937  0.2544880  3.600474
##   1        3      3.572450  0.4912720  2.895811
##   1        4      2.596841  0.7183600  2.106341
##   1        5      2.370161  0.7659777  1.918669
##   1        6      2.276141  0.7881481  1.810001
##   1        7      1.766728  0.8751831  1.390215
##   1        8      1.780946  0.8723243  1.401345
##   1        9      1.665091  0.8819775  1.325515
##   1       10      1.663804  0.8821283  1.327657
##   1       11      1.657738  0.8822967  1.331730
##   1       12      1.653784  0.8827903  1.331504
##   1       13      1.648496  0.8823663  1.316407
##   1       14      1.639073  0.8841742  1.312833
##   1       15      1.639073  0.8841742  1.312833
##   1       16      1.639073  0.8841742  1.312833
##   1       17      1.639073  0.8841742  1.312833
##   1       18      1.639073  0.8841742  1.312833
##   1       19      1.639073  0.8841742  1.312833
##   1       20      1.639073  0.8841742  1.312833
##   1       21      1.639073  0.8841742  1.312833
##   1       22      1.639073  0.8841742  1.312833
##   1       23      1.639073  0.8841742  1.312833
##   1       24      1.639073  0.8841742  1.312833
##   1       25      1.639073  0.8841742  1.312833
##   1       26      1.639073  0.8841742  1.312833
##   1       27      1.639073  0.8841742  1.312833
##   1       28      1.639073  0.8841742  1.312833
##   1       29      1.639073  0.8841742  1.312833
##   1       30      1.639073  0.8841742  1.312833
##   1       31      1.639073  0.8841742  1.312833
##   1       32      1.639073  0.8841742  1.312833
##   1       33      1.639073  0.8841742  1.312833
##   1       34      1.639073  0.8841742  1.312833
##   1       35      1.639073  0.8841742  1.312833
##   1       36      1.639073  0.8841742  1.312833
##   1       37      1.639073  0.8841742  1.312833
##   1       38      1.639073  0.8841742  1.312833
##   2        2      4.327937  0.2544880  3.600474
##   2        3      3.572450  0.4912720  2.895811
##   2        4      2.661826  0.7070510  2.173471
##   2        5      2.404015  0.7578971  1.975387
##   2        6      2.243927  0.7914805  1.783072
##   2        7      1.856336  0.8605482  1.435682
##   2        8      1.754607  0.8763186  1.396841
##   2        9      1.603578  0.8938666  1.261361
##   2       10      1.492421  0.9084998  1.168700
##   2       11      1.317350  0.9292504  1.033926
##   2       12      1.304327  0.9320133  1.019108
##   2       13      1.277510  0.9323681  1.002927
##   2       14      1.269626  0.9350024  1.003346
##   2       15      1.266217  0.9359400  1.013893
##   2       16      1.268470  0.9354868  1.011414
##   2       17      1.268470  0.9354868  1.011414
##   2       18      1.268470  0.9354868  1.011414
##   2       19      1.268470  0.9354868  1.011414
##   2       20      1.268470  0.9354868  1.011414
##   2       21      1.268470  0.9354868  1.011414
##   2       22      1.268470  0.9354868  1.011414
##   2       23      1.268470  0.9354868  1.011414
##   2       24      1.268470  0.9354868  1.011414
##   2       25      1.268470  0.9354868  1.011414
##   2       26      1.268470  0.9354868  1.011414
##   2       27      1.268470  0.9354868  1.011414
##   2       28      1.268470  0.9354868  1.011414
##   2       29      1.268470  0.9354868  1.011414
##   2       30      1.268470  0.9354868  1.011414
##   2       31      1.268470  0.9354868  1.011414
##   2       32      1.268470  0.9354868  1.011414
##   2       33      1.268470  0.9354868  1.011414
##   2       34      1.268470  0.9354868  1.011414
##   2       35      1.268470  0.9354868  1.011414
##   2       36      1.268470  0.9354868  1.011414
##   2       37      1.268470  0.9354868  1.011414
##   2       38      1.268470  0.9354868  1.011414
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
marsTunePred <- predict(marsTuned, newdata = testData$x)
postResample(pred = marsTunePred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

SVM

svmRTuned <- train(trainingData$x, trainingData$y,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   tuneLength = 15,
                   trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.490737  0.8009120  1.982118
##      0.50  2.246868  0.8153042  1.774454
##      1.00  2.051872  0.8400992  1.614368
##      2.00  1.949707  0.8534618  1.524201
##      4.00  1.886125  0.8610205  1.465373
##      8.00  1.849240  0.8654699  1.436630
##     16.00  1.834604  0.8673639  1.429807
##     32.00  1.833221  0.8675754  1.428687
##     64.00  1.833221  0.8675754  1.428687
##    128.00  1.833221  0.8675754  1.428687
##    256.00  1.833221  0.8675754  1.428687
##    512.00  1.833221  0.8675754  1.428687
##   1024.00  1.833221  0.8675754  1.428687
##   2048.00  1.833221  0.8675754  1.428687
##   4096.00  1.833221  0.8675754  1.428687
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06315483
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06315483 and C = 32.
svmPred <- predict(svmRTuned, newdata = testData$x)
postResample(svmPred, testData$y)
##      RMSE  Rsquared       MAE 
## 2.0741473 0.8255848 1.5755185

Neural Network

nnetAvg2 <- avNNet(trainingData$x, trainingData$y,
                  size = 5,
                  decay = 0.01,
                  repeats = 5,
                  linout = TRUE,
                  trace = FALSE,
                  maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetAvg2
## Model Averaged Neural Network with 5 Repeats  
## 
## a 10-5-1 network with 61 weights
## options were - linear output units  decay=0.01
nnetPred <- predict(nnetAvg2, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.5750134 0.9008192 1.2084373

2.Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

To evaluate the best model among K-Nearest Neighbors (KNN), Multivariate Adaptive Regression Splines (MARS), Support Vector Machine (SVM), and Neural Networks, we consider three metrics: Root Mean Squared Error (RMSE), R-squared (R²), and Mean Absolute Error (MAE). Lower RMSE and MAE values indicate greater accuracy, while a higher R² demonstrates a better fit to the data.

MARS clearly outperforms the other models across all metrics. With an RMSE of 1.159 and an R² of 0.946, MARS has the lowest error and explains 94.6% of the variance, indicating strong predictive accuracy. Its MAE of 0.925 further reflects its precision. Neural Networks and SVM show moderate accuracy, with the Neural Network achieving an RMSE of 1.920 and an R² of 0.8505, slightly outperforming SVM’s RMSE of 2.080 and R² of 0.8247. Both models, however, exhibit higher average errors than MARS.

KNN, with an RMSE of 3.204 and an R² of 0.6819, performs the worst, making it the least suitable for this dataset. Given its low error rates and high explanatory power, MARS stands out as the optimal model for accurate predictions on this dataset, with Neural Networks and SVM offering reasonable alternatives.

7.5. Chemical Manufacturing Process

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

data("ChemicalManufacturingProcess")
cherNearZero <-  nearZeroVar(ChemicalManufacturingProcess)

chermical <- ChemicalManufacturingProcess[,-cherNearZero]

Impute values using The mice() function from the mice package, which is used to perform multiple imputations on the chermical dataset to fill in missing values. The method used is predictive mean matching (method = ‘pmm’), which ensures that imputed values are plausible and within the range of observed data.

set.seed(2425)  # For reproducibility


chermical_imp <- mice(chermical, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)
# chermical_imp <- mice(ChemicalManufacturingProcess, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)

chermical_comp <-  complete(chermical_imp,1)

Creating Training and Test Data

cherTrainIndex <- sample(1:nrow(chermical_comp), size = 0.8 * nrow(chermical_comp))

chermicalTrain <- chermical_comp[cherTrainIndex,] 
chermicalTest <- chermical_comp[-cherTrainIndex,]

KNN Model

set.seed(65465)

knnTune <- train(chermicalTrain[,-1], chermicalTrain$Yield, method = "knn", preProc =c("center","scale","spatialSign"), tuneGrid = data.frame(.k = 1:30), trControl = trainControl(method = "cv"))
knnTune
## k-Nearest Neighbors 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), spatial sign transformation (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 124, 127, 127, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    1  1.459767  0.4913098  1.1128985
##    2  1.282480  0.5440559  1.0454915
##    3  1.193748  0.5935934  0.9542917
##    4  1.240001  0.5664676  0.9688212
##    5  1.256189  0.5604485  1.0068715
##    6  1.241136  0.5630840  0.9924724
##    7  1.267863  0.5479344  1.0189243
##    8  1.263960  0.5399983  0.9940814
##    9  1.265933  0.5336213  1.0028765
##   10  1.275894  0.5388417  1.0069294
##   11  1.290873  0.5200642  1.0286925
##   12  1.312301  0.5041005  1.0459005
##   13  1.322983  0.4921448  1.0530545
##   14  1.320720  0.4867416  1.0505258
##   15  1.325986  0.4853188  1.0570550
##   16  1.335565  0.4749236  1.0625339
##   17  1.334382  0.4727245  1.0577035
##   18  1.334281  0.4779326  1.0601517
##   19  1.341120  0.4737133  1.0684430
##   20  1.357832  0.4617423  1.0835761
##   21  1.359038  0.4600221  1.0939476
##   22  1.368460  0.4525419  1.1026484
##   23  1.381758  0.4413969  1.1157799
##   24  1.396067  0.4306463  1.1299260
##   25  1.396761  0.4334916  1.1295980
##   26  1.404618  0.4279489  1.1386112
##   27  1.403595  0.4281688  1.1369182
##   28  1.411577  0.4219093  1.1482942
##   29  1.417295  0.4201424  1.1466305
##   30  1.426326  0.4094883  1.1570012
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
knnPred <- predict(knnTune,newdata = chermicalTest[,-1])
postResample(pred = knnPred, obs = chermicalTest$Yield)
##      RMSE  Rsquared       MAE 
## 1.3886512 0.4633135 1.1652778

SVM

set.seed(65465)

svmTune <- train(chermicalTrain[,-1], chermicalTrain$Yield, method = "svmRadial", preProc = c("center","scale"), tuneLength = 20, trControl = trainControl(method = "cv"))
svmTune
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 124, 127, 127, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   C          RMSE      Rsquared   MAE      
##        0.25  1.367151  0.5134422  1.1177431
##        0.50  1.258553  0.5609218  1.0120002
##        1.00  1.196853  0.5974907  0.9508429
##        2.00  1.178010  0.6078867  0.9339698
##        4.00  1.186016  0.6022658  0.9327415
##        8.00  1.178600  0.6075962  0.9359434
##       16.00  1.170438  0.6138352  0.9317070
##       32.00  1.170438  0.6138352  0.9317070
##       64.00  1.170438  0.6138352  0.9317070
##      128.00  1.170438  0.6138352  0.9317070
##      256.00  1.170438  0.6138352  0.9317070
##      512.00  1.170438  0.6138352  0.9317070
##     1024.00  1.170438  0.6138352  0.9317070
##     2048.00  1.170438  0.6138352  0.9317070
##     4096.00  1.170438  0.6138352  0.9317070
##     8192.00  1.170438  0.6138352  0.9317070
##    16384.00  1.170438  0.6138352  0.9317070
##    32768.00  1.170438  0.6138352  0.9317070
##    65536.00  1.170438  0.6138352  0.9317070
##   131072.00  1.170438  0.6138352  0.9317070
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01475522
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01475522 and C = 16.
svmPred <- predict(svmTune,newdata = chermicalTest[,-1])
postResample(pred = svmPred, obs = chermicalTest$Yield)
##      RMSE  Rsquared       MAE 
## 1.1428579 0.6467647 0.9384073

Neural Network

highAutoC <- findCorrelation(cor(chermicalTrain), cutoff = .75)

chermical_train_nauto <- chermical_comp[,-highAutoC]
chermical_train_cl <- chermical_train_nauto[cherTrainIndex,] 
chermical_test_cl <- chermical_train_nauto[-cherTrainIndex,] 
set.seed(65465)


nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),.size = c(1:10),.bag = FALSE)

nnetTune <- train(chermical_train_cl[,-1], chermical_train_cl$Yield, method = "avNNet", preProc = c("center","scale","spatialSign"), trControl = trainControl(method = "cv"),
                 tuneGrid = nnetGrid,linout=TRUE,trace=FALSE,MaxNWts = 10 * (ncol(chermicalTrain) + 1) + 10 + 1,maxit=500

                 )
nnetTune
## Model Averaged Neural Network 
## 
## 140 samples
##  38 predictor
## 
## Pre-processing: centered (38), scaled (38), spatial sign transformation (38) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 124, 127, 127, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.409190  0.4353177  1.130543
##   0.00    2    3.508845  0.2430419  2.393885
##   0.00    3    2.663527  0.3698206  2.080390
##   0.00    4    2.088412  0.2695218  1.568408
##   0.00    5    1.509724  0.3840878  1.234838
##   0.00    6    1.590429  0.4091841  1.281557
##   0.00    7    1.990053  0.3442196  1.555718
##   0.00    8    1.764573  0.3952079  1.436233
##   0.00    9    2.694243  0.2402998  1.941267
##   0.00   10    2.847788  0.3104934  1.913652
##   0.01    1    1.451373  0.4367762  1.140688
##   0.01    2    1.529910  0.4305922  1.165592
##   0.01    3    1.599630  0.4214768  1.235717
##   0.01    4    1.562862  0.4123248  1.233834
##   0.01    5    1.403478  0.4696823  1.101737
##   0.01    6    1.375578  0.4909363  1.068646
##   0.01    7    1.309946  0.5226668  1.021425
##   0.01    8    1.312935  0.5203935  1.029290
##   0.01    9    1.390897  0.4924391  1.070437
##   0.01   10    1.432342  0.4688581  1.139593
##   0.10    1    1.335493  0.5094081  1.072773
##   0.10    2    1.265746  0.5539988  0.974298
##   0.10    3    1.276466  0.5769042  1.060896
##   0.10    4    1.386915  0.4914073  1.113241
##   0.10    5    1.320208  0.5308097  1.042065
##   0.10    6    1.388685  0.4916542  1.103497
##   0.10    7    1.323406  0.5237965  1.064777
##   0.10    8    1.332773  0.5223034  1.059356
##   0.10    9    1.270659  0.5461069  1.019994
##   0.10   10    1.322352  0.5179641  1.040021
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 2, decay = 0.1 and bag = FALSE.
nnetPred <- predict(nnetTune,newdata = chermical_test_cl[,-1])
postResample(pred = nnetPred, obs = chermical_test_cl$Yield)
##      RMSE  Rsquared       MAE 
## 1.2594481 0.5611685 1.0214026

MARS

set.seed(65465)


marsTune <- train(chermicalTrain[,-1], chermicalTrain$Yield, method = "earth", preProc = c("center","scale","spatialSign"), trControl = trainControl(method = "cv"), tuneGrid = expand.grid(.degree = 1:2, .nprune = 2:38))
marsTune
## Multivariate Adaptive Regression Spline 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56), spatial sign transformation (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 124, 127, 127, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.377526  0.4476714  1.1058767
##   1        3      1.198196  0.5879336  0.9334591
##   1        4      1.247017  0.5749239  0.9999451
##   1        5      1.262821  0.5407779  0.9903407
##   1        6      1.263545  0.5432153  0.9832537
##   1        7      1.278015  0.5417628  1.0098536
##   1        8      1.283297  0.5366740  1.0169510
##   1        9      1.280583  0.5427513  1.0130993
##   1       10      1.234627  0.5763298  0.9582423
##   1       11      1.252894  0.5602808  0.9734208
##   1       12      1.244299  0.5712988  0.9592809
##   1       13      1.203846  0.6029083  0.9314943
##   1       14      1.202876  0.6009887  0.9270194
##   1       15      1.190118  0.6080222  0.9093196
##   1       16      1.193839  0.6070685  0.9170193
##   1       17      1.195253  0.6075284  0.9188130
##   1       18      1.195253  0.6075284  0.9188130
##   1       19      1.195253  0.6075284  0.9188130
##   1       20      1.195253  0.6075284  0.9188130
##   1       21      1.195253  0.6075284  0.9188130
##   1       22      1.195253  0.6075284  0.9188130
##   1       23      1.195253  0.6075284  0.9188130
##   1       24      1.195253  0.6075284  0.9188130
##   1       25      1.195253  0.6075284  0.9188130
##   1       26      1.195253  0.6075284  0.9188130
##   1       27      1.195253  0.6075284  0.9188130
##   1       28      1.195253  0.6075284  0.9188130
##   1       29      1.195253  0.6075284  0.9188130
##   1       30      1.195253  0.6075284  0.9188130
##   1       31      1.195253  0.6075284  0.9188130
##   1       32      1.195253  0.6075284  0.9188130
##   1       33      1.195253  0.6075284  0.9188130
##   1       34      1.195253  0.6075284  0.9188130
##   1       35      1.195253  0.6075284  0.9188130
##   1       36      1.195253  0.6075284  0.9188130
##   1       37      1.195253  0.6075284  0.9188130
##   1       38      1.195253  0.6075284  0.9188130
##   2        2      1.417372  0.4221761  1.1403206
##   2        3      1.245186  0.5599901  0.9763990
##   2        4      1.281394  0.5296799  1.0368221
##   2        5      1.281787  0.5354372  1.0358778
##   2        6      1.246398  0.5596954  1.0041570
##   2        7      1.307282  0.5154456  1.0483305
##   2        8      1.315359  0.5057966  1.0764262
##   2        9      1.288481  0.5331924  1.0733248
##   2       10      1.256300  0.5521020  1.0278193
##   2       11      1.259719  0.5628655  1.0297708
##   2       12      1.233580  0.5734883  1.0102186
##   2       13      1.225272  0.5833964  1.0037035
##   2       14      1.245066  0.5791751  1.0396496
##   2       15      1.237500  0.5852586  1.0252516
##   2       16      1.233606  0.5873963  1.0195679
##   2       17      1.231766  0.5886096  1.0153754
##   2       18      1.229743  0.5898874  0.9973734
##   2       19      1.216701  0.5932664  0.9904943
##   2       20      1.217163  0.5926963  0.9923476
##   2       21      1.226764  0.5884678  0.9983423
##   2       22      1.223726  0.5893837  0.9946602
##   2       23      1.221494  0.5910793  0.9935113
##   2       24      1.226785  0.5873479  1.0065927
##   2       25      1.227607  0.5871840  1.0094157
##   2       26      1.227607  0.5871840  1.0094157
##   2       27      1.227607  0.5871840  1.0094157
##   2       28      1.227607  0.5871840  1.0094157
##   2       29      1.227607  0.5871840  1.0094157
##   2       30      1.227607  0.5871840  1.0094157
##   2       31      1.227607  0.5871840  1.0094157
##   2       32      1.227607  0.5871840  1.0094157
##   2       33      1.227607  0.5871840  1.0094157
##   2       34      1.227607  0.5871840  1.0094157
##   2       35      1.227607  0.5871840  1.0094157
##   2       36      1.227607  0.5871840  1.0094157
##   2       37      1.227607  0.5871840  1.0094157
##   2       38      1.227607  0.5871840  1.0094157
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 1.
marsPred <- predict(marsTune,newdata = chermicalTest[,-1])
postResample(pred = marsPred, obs = chermical_test_cl$Yield)
##      RMSE  Rsquared       MAE 
## 1.3788641 0.5485485 1.1152526

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

The model comparison reveals that Support Vector Machine (SVM) is the best-performing model among KNN, Neural Network, and MARS based on Root Mean Squared Error (RMSE), R-squared (R²), and Mean Absolute Error (MAE). SVM achieves the lowest RMSE (1.143) and MAE (0.938), indicating it has the smallest average prediction errors, and the highest R² (0.647), showing it explains the most variance in the target variable. This superior performance across all metrics suggests that SVM provides the most accurate and reliable predictions for the dataset.

While the Neural Network and MARS models also perform moderately well, their higher RMSE and MAE values and lower R² make them less suitable than SVM. KNN, with the highest error rates and lowest R², is the least effective model. Overall, SVM stands out as the optimal choice for predictive accuracy and model fit.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The Overall score in the varImp() output from the caret package represents the relative importance of each variable in the model. This score quantifies how much a variable contributes to the model’s predictive performance.

predictorImportance <- varImp(svmTune)$importance
predictorImportance$Name <- rownames(predictorImportance)

predictorImportance <- predictorImportance[order(-predictorImportance$Overall),]
predictorImportance <- predictorImportance[1:nrow(predictorImportance),]
rownames(predictorImportance) <- NULL

predictorImportance |> head(10) |> kable() |> kable_styling() |>  kable_classic(full_width = F)
Overall Name
100.00000 ManufacturingProcess32
90.00513 BiologicalMaterial06
73.45538 BiologicalMaterial03
71.63961 ManufacturingProcess13
68.82344 ManufacturingProcess36
67.71626 BiologicalMaterial02
62.52098 ManufacturingProcess31
61.19694 ManufacturingProcess17
58.10035 ManufacturingProcess09
57.34811 BiologicalMaterial12
plot(varImp(svmTune), top=10)

Score of contribution from each Variable Class:

bio <- predictorImportance |> filter(grepl("Biological",Name)) |> summarise(Total_Score=sum(Overall))
manu <- predictorImportance |> filter(grepl("Manufacturing",Name)) |> summarise(Total_Score=sum(Overall))
bio$VariableType <- "Biological"
manu$VariableType <- "Manufacturing"

bind_rows(
  
  bio,
  manu
) |> kable() |> kable_styling() |>  kable_classic(full_width = F)
Total_Score VariableType
530.6131 Biological
974.4208 Manufacturing

In the optimal nonlinear regression model, ManufacturingProcess32 ranks as the most important predictor, with an overall score of 100. This suggests that it is the strongest contributor to the model’s predictive power. Following this, BiologicalMaterial06 and BiologicalMaterial03 also show high importance scores (90.01 and 73.46, respectively), indicating that key biological materials play a significant role in predicting yield. ManufacturingProcess13 and ManufacturingProcess36 are also highly ranked (71.64 and 68.82), suggesting that certain manufacturing processes are critical as well.

While both types of predictors—biological and process variables—are represented in the top rankings, process variables slightly dominate the list. The top predictor is a manufacturing process, and four of the top five predictors are process-related. Overall, process variables appear more frequently and with relatively high importance scores compared to biological materials, though specific biological inputs, particularly BiologicalMaterial06 and BiologicalMaterial03, remain influential.

This distribution suggests that, while biological factors contribute to yield, refining certain stages of the manufacturing process may have a stronger overall impact in optimizing the model’s predictive performance.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

topNpred <- predictorImportance$Name |> head(10)
topNpred <- append(topNpred,"Yield")
chermical_pivot <- chermical_comp[,topNpred] |> pivot_longer(cols = -Yield,names_to = "Variable",values_to = "Value")

ggplot( chermical_pivot, aes(x = Value, y = Yield)) +

  geom_point() +
  facet_wrap(~Variable, scales = "free_x",ncol = 3) +
  theme_minimal()

top10Data <- chermical_comp |> select(all_of(topNpred))
correlations <- cor(top10Data)
corrplot.mixed(correlations, tl.col = 'black', tl.pos = 'lt', 
         upper = "number", lower="circle")

This correlation matrix provides insights into the relationships among manufacturing processes, biological materials, and their potential influence on yield. Notably, ManufacturingProcess32, ManufacturingProcess09, and BiologicalMaterial02 exhibit moderate positive correlations with yield (r = 0.608, r = 0.503, and r = 0.482, respectively), suggesting that these factors may positively contribute to higher yield outcomes. Conversely, ManufacturingProcess13 and ManufacturingProcess36 show moderate negative correlations with yield (r = -0.504 and r = -0.516), indicating that increases in these variables may be associated with lower yields. These relationships highlight specific processes that could be adjusted or optimized to enhance yield.

The matrix also reveals patterns of association within the biological and manufacturing variables themselves. Biological materials, particularly BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial02, have strong positive correlations with one another (with correlations above r = 0.8). This suggests these materials may operate in tandem or respond similarly within the manufacturing environment, potentially due to shared biochemical properties or similar roles in the production process. Additionally, ManufacturingProcess13 and ManufacturingProcess09 exhibit a strong negative correlation (r = -0.791), indicating a possible trade-off or inverse relationship between these two processes, which could be a critical factor in production balancing.

Overall, this matrix suggests that yield is likely driven by a combination of both biological and manufacturing variables rather than a single dominant factor. While certain biological materials positively influence yield, some manufacturing processes appear to both support and inhibit it. These insights could guide further investigation into optimizing yield by balancing or adjusting specific manufacturing processes and exploring interactions between biological materials and production conditions.