Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 \sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \text{noise}\)

Which models appear to give the best performance? Does MARS select the informative predictors (those named x1–x5)?

## Loading required package: Rcpp
## Loading required package: ggplot2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:RSNNS':
## 
##     confusionMatrix, train
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
# convert x data from a matrix to datafrane 
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)

#This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)
featurePlot(testData$x, testData$y)

knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Model Performance (k-Nearest Neighbors)

From the KNN model tuning results: - Best RMSE (lowest): 3.18 (at k = 19) - Best R² (highest): 0.652 (also at k = 19) - Final model selected: k = 19 - Test performance:
- RMSE: 3.229
- R²: 0.687

These results suggest that KNN with k = 19 gives the best performance among the KNN models tested, showing reasonable predictive accuracy on the nonlinear Friedman1 simulation.

Average Neural Net - 5 layers

library(nnet)
set.seed(421)

layers <- 5

avnnet_fit <- avNNet(
  trainingData$x,
  trainingData$y,
  decay = .01,
  size = layers,
  repeats = 5,
  maxit = 500,
  bag = TRUE
)
## Warning: executing %dopar% sequentially: no parallel backend registered
## Fitting Repeat 1 
## 
## # weights:  61
## initial  value 45533.812204 
## iter  10 value 42651.755288
## iter  20 value 42550.319453
## iter  30 value 42548.534053
## iter  40 value 42548.422007
## iter  40 value 42548.421641
## iter  40 value 42548.421506
## final  value 42548.421506 
## converged
## Fitting Repeat 2 
## 
## # weights:  61
## initial  value 43344.694521 
## iter  10 value 40188.948530
## iter  20 value 40108.975272
## iter  30 value 40107.586044
## iter  40 value 40107.495096
## iter  40 value 40107.494937
## iter  40 value 40107.494884
## final  value 40107.494884 
## converged
## Fitting Repeat 3 
## 
## # weights:  61
## initial  value 45575.552973 
## iter  10 value 42240.911256
## iter  20 value 42178.065608
## iter  30 value 42176.176629
## iter  40 value 42176.052626
## final  value 42176.050418 
## converged
## Fitting Repeat 4 
## 
## # weights:  61
## initial  value 45002.076591 
## iter  10 value 41253.095282
## iter  20 value 41251.286533
## final  value 41251.245119 
## converged
## Fitting Repeat 5 
## 
## # weights:  61
## initial  value 42723.104381 
## iter  10 value 39743.427266
## iter  20 value 39731.736479
## iter  30 value 39731.636476
## iter  30 value 39731.636253
## iter  30 value 39731.636235
## final  value 39731.636235 
## converged
nnet_pred <- predict(
  avnnet_fit,
  newdata = testData$x
)

postResample(
  pred = nnet_pred,
  obs = testData$y
)
##       RMSE   Rsquared        MAE 
## 14.2769353  0.2814884 13.3869178

Multivariate Adaptive Regression Splines (MARS)

MARS correctly identifies and models the nonlinearities and interactions in the data (e.g., sin, square terms). - The plots also help validate that the important predictors are X1–X5, with X6 being non-informative, as intended in the Friedman1 dataset. - These plots are a great diagnostic tool for feature importance and interpretability in nonlinear models like MARS.

library(earth)
mars_fit <- earth(
  trainingData$x,
  trainingData$y
)

summary(mars_fit)
## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
plotmo(mars_fit)
##  plotmo grid:    X1        X2       X3        X4        X5        X6        X7
##           0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
##        X8        X9       X10
##  0.497961 0.5288716 0.5359218

mars_pred <- predict(
  mars_fit,
  newdata = testData$x
)

postResample(
  pred = mars_pred,
  obs = testData$y
)
##      RMSE  Rsquared       MAE 
## 1.8136467 0.8677298 1.3911836
mars_iter_fit <- train(
  trainingData$x,
  trainingData$y,
  method = "earth",
  tuneGrid = expand.grid(.degree = 1:4, .nprune = 2:50),
  trControl = trainControl(method = "cv")
)

mars_iter_fit$finalModel
## Selected 15 of 18 terms, and 5 of 10 predictors (nprune=15)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 10 4
## GCV 1.618197    RSS 217.6151    GRSq 0.9343005    RSq 0.9553786
plot(varImp(mars_iter_fit))

mars_iter_pred <- predict(
  mars_iter_fit,
  newdata = testData$x
)

postResample(
  pred = mars_iter_pred,
  obs = testData$y
)
##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

MARS performs exceptionally well, achieving the highest R² and lowest GCV among all models. Importantly, it automatically identified the informative predictors (X1 to X5), which aligns with the true structure of the Friedman1 simulated data.

MARS is the best-performing model and shows strong variable selection capabilities.

Neural networks may benefit from PCA, especially on datasets like Tecator or Friedman1, where features are highly correlated.

KNN offers a solid balance of simplicity and predictive power but may lag behind more flexible models like MARS.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
# Make this reproducible
set.seed(42)

knn_model <- preProcess(chem, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)

df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]

pls_model <- train(
  Yield ~ ., data = train_df, method = "pls",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)

pls_model
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8824790  0.3779221  0.6711462
##    2     1.1458456  0.4219806  0.7086431
##    3     0.7363066  0.5244517  0.5688553
##    4     0.8235294  0.5298005  0.5933120
##    5     0.9670735  0.4846010  0.6371199
##    6     0.9959036  0.4776684  0.6427478
##    7     0.9119517  0.4986338  0.6200233
##    8     0.9068621  0.5012144  0.6293371
##    9     0.8517370  0.5220166  0.6163795
##   10     0.8919356  0.5062912  0.6332243
##   11     0.9173758  0.4934557  0.6463164
##   12     0.9064125  0.4791526  0.6485663
##   13     0.9255289  0.4542181  0.6620193
##   14     1.0239913  0.4358371  0.6944056
##   15     1.0754710  0.4365214  0.7077991
##   16     1.1110579  0.4269065  0.7135684
##   17     1.1492855  0.4210485  0.7222868
##   18     1.1940639  0.4132534  0.7396357
##   19     1.2271867  0.4079005  0.7494818
##   20     1.2077102  0.4022859  0.7470327
##   21     1.2082648  0.4026711  0.7452969
##   22     1.2669285  0.3987044  0.7634170
##   23     1.3663033  0.3970188  0.7957514
##   24     1.4531634  0.3898475  0.8243034
##   25     1.5624265  0.3820102  0.8612935
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
pls_predictions <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "PLS")
results
##        RMSE  Rsquared       MAE Model
## 1 0.6192577 0.6771122 0.5059984   PLS
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
# Make this reproducible
set.seed(42)

knn_model <- preProcess(chem, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)

df <- df %>%
  select_at(vars(-one_of(nearZeroVar(., names = TRUE))))

in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]

pls_model <- train(
  Yield ~ ., data = train_df, method = "pls",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)

pls_model
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.8824790  0.3779221  0.6711462
##    2     1.1458456  0.4219806  0.7086431
##    3     0.7363066  0.5244517  0.5688553
##    4     0.8235294  0.5298005  0.5933120
##    5     0.9670735  0.4846010  0.6371199
##    6     0.9959036  0.4776684  0.6427478
##    7     0.9119517  0.4986338  0.6200233
##    8     0.9068621  0.5012144  0.6293371
##    9     0.8517370  0.5220166  0.6163795
##   10     0.8919356  0.5062912  0.6332243
##   11     0.9173758  0.4934557  0.6463164
##   12     0.9064125  0.4791526  0.6485663
##   13     0.9255289  0.4542181  0.6620193
##   14     1.0239913  0.4358371  0.6944056
##   15     1.0754710  0.4365214  0.7077991
##   16     1.1110579  0.4269065  0.7135684
##   17     1.1492855  0.4210485  0.7222868
##   18     1.1940639  0.4132534  0.7396357
##   19     1.2271867  0.4079005  0.7494818
##   20     1.2077102  0.4022859  0.7470327
##   21     1.2082648  0.4026711  0.7452969
##   22     1.2669285  0.3987044  0.7634170
##   23     1.3663033  0.3970188  0.7957514
##   24     1.4531634  0.3898475  0.8243034
##   25     1.5624265  0.3820102  0.8612935
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
pls_predictions <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "PLS")
results
##        RMSE  Rsquared       MAE Model
## 1 0.6192577 0.6771122 0.5059984   PLS

KNN Model

knn_model <- train(
  Yield ~ ., data = train_df, method = "knn",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
knn_model
## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 130, 130, 130, 131, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7085987  0.4846711  0.5709345
##    7  0.7331379  0.4559107  0.5981516
##    9  0.7417741  0.4556845  0.6124459
##   11  0.7505969  0.4526780  0.6277377
##   13  0.7450412  0.4691830  0.6181457
##   15  0.7539451  0.4604423  0.6246762
##   17  0.7663763  0.4474991  0.6306860
##   19  0.7694326  0.4480450  0.6315047
##   21  0.7706841  0.4508441  0.6288689
##   23  0.7780866  0.4419184  0.6350889
##   25  0.7834333  0.4320541  0.6382165
##   27  0.7877890  0.4339581  0.6454522
##   29  0.7981302  0.4187796  0.6533767
##   31  0.8060716  0.3977540  0.6610186
##   33  0.8063566  0.4128701  0.6600828
##   35  0.8144040  0.3964037  0.6671734
##   37  0.8160580  0.4009313  0.6662999
##   39  0.8220176  0.3884786  0.6693883
##   41  0.8274778  0.3750873  0.6746765
##   43  0.8288000  0.3820179  0.6758220
##   45  0.8285016  0.3878549  0.6747216
##   47  0.8331504  0.3808342  0.6774480
##   49  0.8368096  0.3816669  0.6829726
##   51  0.8391074  0.3804316  0.6857959
##   53  0.8442763  0.3704957  0.6915276
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
knn_predictions <- predict(knn_model, test_df)

results <- data.frame(t(postResample(pred = knn_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "KNN") %>% rbind(results)
print(results)
##        RMSE  Rsquared       MAE Model
## 1 0.7149819 0.5894607 0.5047973   KNN
## 2 0.6192577 0.6771122 0.5059984   PLS

MARS MODEL

MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)

MARS_model <- train(
  Yield ~ ., data = train_df, method = "earth",
  tuneGrid = MARS_grid,
  # If the following lines are uncommented, it throws an error
  #center = TRUE,
  #scale = TRUE,
  trControl = trainControl("cv", number = 10),
  tuneLength = 25
)
MARS_model
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 128, 130, 128, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7744594  0.4159942  0.6084363
##   1        3      0.7003597  0.5682276  0.5530374
##   1        4      0.6469215  0.6086750  0.5154043
##   1        5      0.6404520  0.6073071  0.5042864
##   1        6      0.6960964  0.5396157  0.5588565
##   1        7      0.6997921  0.5521698  0.5669813
##   1        8      0.7147341  0.5423472  0.5621022
##   1        9      0.7173559  0.5508650  0.5641610
##   1       10      0.7487512  0.5325936  0.5869088
##   1       11      0.7415722  0.5353676  0.5847493
##   1       12      0.7353149  0.5519142  0.5768069
##   1       13      0.7450704  0.5357967  0.5887685
##   1       14      0.7419274  0.5400866  0.5844385
##   1       15      0.7262459  0.5553445  0.5775694
##   2        2      0.7814439  0.4052275  0.6150579
##   2        3      0.6961165  0.5526862  0.5636101
##   2        4      0.6983163  0.5544842  0.5575337
##   2        5      0.6730604  0.5897508  0.5262153
##   2        6      0.6707267  0.5848137  0.5147009
##   2        7      0.6592644  0.5959026  0.5199324
##   2        8      0.6170179  0.6400977  0.4808173
##   2        9      0.5848818  0.6718214  0.4591934
##   2       10      0.5844613  0.6757433  0.4556203
##   2       11      0.5868888  0.6684397  0.4622055
##   2       12      0.5804160  0.6762717  0.4550110
##   2       13      0.5822871  0.6910726  0.4475039
##   2       14      0.5914308  0.6855774  0.4499391
##   2       15      0.5796279  0.6970556  0.4450954
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
MARS_predictions <- predict(MARS_model, test_df)

results <- data.frame(t(postResample(pred = MARS_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "MARS") %>% rbind(results)
print(results)
##        RMSE  Rsquared       MAE Model
## 1 0.7694337 0.5378346 0.5809592  MARS
## 2 0.7149819 0.5894607 0.5047973   KNN
## 3 0.6192577 0.6771122 0.5059984   PLS

SVM Model

SVM_model <- train(
  Yield ~ ., data = train_df, method = "svmRadial",
  center = TRUE,
  scale = TRUE,
  trControl = trainControl(method = "cv"),
  tuneLength = 25
)
SVM_model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   C           RMSE       Rsquared   MAE      
##         0.25  0.7683015  0.4646282  0.6392703
##         0.50  0.7174781  0.4848381  0.5965095
##         1.00  0.6686679  0.5330949  0.5530530
##         2.00  0.6409061  0.5552794  0.5275947
##         4.00  0.6356640  0.5619233  0.5166512
##         8.00  0.6286151  0.5761807  0.5115837
##        16.00  0.6287131  0.5760690  0.5116965
##        32.00  0.6287131  0.5760690  0.5116965
##        64.00  0.6287131  0.5760690  0.5116965
##       128.00  0.6287131  0.5760690  0.5116965
##       256.00  0.6287131  0.5760690  0.5116965
##       512.00  0.6287131  0.5760690  0.5116965
##      1024.00  0.6287131  0.5760690  0.5116965
##      2048.00  0.6287131  0.5760690  0.5116965
##      4096.00  0.6287131  0.5760690  0.5116965
##      8192.00  0.6287131  0.5760690  0.5116965
##     16384.00  0.6287131  0.5760690  0.5116965
##     32768.00  0.6287131  0.5760690  0.5116965
##     65536.00  0.6287131  0.5760690  0.5116965
##    131072.00  0.6287131  0.5760690  0.5116965
##    262144.00  0.6287131  0.5760690  0.5116965
##    524288.00  0.6287131  0.5760690  0.5116965
##   1048576.00  0.6287131  0.5760690  0.5116965
##   2097152.00  0.6287131  0.5760690  0.5116965
##   4194304.00  0.6287131  0.5760690  0.5116965
## 
## Tuning parameter 'sigma' was held constant at a value of 0.015148
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.015148 and C = 8.
SVM_predictions <- predict(SVM_model, test_df)

results <- data.frame(t(postResample(pred = SVM_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "SVM") %>% rbind(results)

Neural Network Model

nnet_grid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnet_maxnwts <- 5 * ncol(train_df) + 5 + 1
nnet_model <- train(
  Yield ~ ., data = train_df, method = "avNNet",
  center = TRUE,
  scale = TRUE,
  tuneGrid = nnet_grid,
  trControl = trainControl(method = "cv"),
  linout = TRUE,
  trace = FALSE,
  MaxNWts = nnet_maxnwts,
  maxit = 500
)

nnet_model
## Model Averaged Neural Network 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 131, 129, 130, 128, 130, 129, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    0.7910248  0.4228412  0.6130383
##   0.00    2    0.9218862  0.4080614  0.7428019
##   0.00    3    0.7740539  0.5185395  0.6172976
##   0.00    4    0.7024795  0.5648305  0.5597466
##   0.00    5    0.6390399  0.6194867  0.5144565
##   0.00    6          NaN        NaN        NaN
##   0.00    7          NaN        NaN        NaN
##   0.00    8          NaN        NaN        NaN
##   0.00    9          NaN        NaN        NaN
##   0.00   10          NaN        NaN        NaN
##   0.01    1    0.7696626  0.4978424  0.6083146
##   0.01    2    0.7584432  0.5120800  0.6254292
##   0.01    3    0.7995990  0.5296820  0.6455941
##   0.01    4    0.6883856  0.5995827  0.5517506
##   0.01    5    0.6843272  0.5975191  0.5467633
##   0.01    6          NaN        NaN        NaN
##   0.01    7          NaN        NaN        NaN
##   0.01    8          NaN        NaN        NaN
##   0.01    9          NaN        NaN        NaN
##   0.01   10          NaN        NaN        NaN
##   0.10    1    0.7121318  0.5409956  0.5882917
##   0.10    2    0.6924135  0.5545770  0.5497616
##   0.10    3    0.6739546  0.6037860  0.5412176
##   0.10    4    0.6559497  0.6294231  0.5376020
##   0.10    5    0.6453881  0.6413499  0.5201350
##   0.10    6          NaN        NaN        NaN
##   0.10    7          NaN        NaN        NaN
##   0.10    8          NaN        NaN        NaN
##   0.10    9          NaN        NaN        NaN
##   0.10   10          NaN        NaN        NaN
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0 and bag = FALSE.

Based on the performance metrics provided, the Partial Least Squares (PLS) model demonstrated the best overall performance among the models compared. It achieved the lowest root mean squared error (RMSE) of 0.6193, indicating the smallest average prediction error, and the highest R-squared value of 0.6771, meaning it explained the most variance in the outcome variable. Additionally, its mean absolute error (MAE) of 0.5060 was among the lowest, further confirming its accuracy and consistency. While the Support Vector Machine (SVM) model also performed well with a slightly lower MAE, its higher RMSE and lower R-squared make it slightly less favorable than PLS. On the other hand, the MARS model showed the weakest performance, with the highest RMSE and lowest R-squared. Overall, the PLS model stands out as the most effective choice based on the evaluation metrics.

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
nnet_predictions <- predict(nnet_model, test_df)

results <- data.frame(t(postResample(pred = nnet_predictions, obs = test_df$Yield))) %>%
  mutate("Model"= "Neural Network") %>% rbind(results)
results %>% unique() %>% kable() %>% kable_styling()
RMSE Rsquared MAE Model
0.7197787 0.5648347 0.5760858 Neural Network
0.6815833 0.6251941 0.4926105 SVM
0.7694337 0.5378346 0.5809592 MARS
0.7149819 0.5894607 0.5047973 KNN
0.6192577 0.6771122 0.5059984 PLS

The model relies most heavily on ManufacturingProcess32, but a combination of process- and material-related features drive the predictive performance of the SVM model. This type of output is useful for feature selection, interpretability, and guiding domain-specific investigations into critical quality factors.

The consistently high ranking of ManufacturingProcess32, 13, 09, and 36 suggests they are key drivers of the target variable. The presence of BiologicalMaterial features indicates the interaction between material properties and process controls is critical in modeling performance or quality outcomes for pls model.

varImp(SVM_model, 10)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   93.82
## ManufacturingProcess09   89.93
## ManufacturingProcess17   88.20
## BiologicalMaterial06     82.61
## BiologicalMaterial03     79.44
## ManufacturingProcess36   73.85
## BiologicalMaterial12     72.36
## ManufacturingProcess06   69.00
## ManufacturingProcess11   62.34
## ManufacturingProcess31   56.39
## BiologicalMaterial02     50.34
## BiologicalMaterial11     48.53
## BiologicalMaterial09     44.76
## ManufacturingProcess30   41.87
## BiologicalMaterial08     40.24
## ManufacturingProcess29   38.54
## ManufacturingProcess33   38.16
## BiologicalMaterial04     36.92
## ManufacturingProcess25   36.83
varImp(pls_model,10)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   88.04
## ManufacturingProcess36   82.20
## ManufacturingProcess13   82.11
## ManufacturingProcess17   80.25
## ManufacturingProcess06   59.06
## ManufacturingProcess11   55.93
## BiologicalMaterial02     55.46
## BiologicalMaterial06     54.64
## BiologicalMaterial03     54.50
## ManufacturingProcess33   53.91
## ManufacturingProcess12   52.04
## BiologicalMaterial08     49.76
## BiologicalMaterial12     47.40
## ManufacturingProcess34   45.47
## BiologicalMaterial11     45.05
## BiologicalMaterial01     44.18
## BiologicalMaterial04     42.95
## ManufacturingProcess04   39.94
## ManufacturingProcess28   36.61

-c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

predictors_<- varImp(SVM_model)$importance |>
  arrange(desc(Overall)) |>
  head(10)
SVM_predictions <- predict(SVM_model, test_df)

relationship_df <- test_df |> mutate(Yield = SVM_predictions)


for (predictor in rownames(predictors_)){
  plot <- ggplot(relationship_df, aes_string(x = predictor, y = "Yield")) +
    geom_point(color = "blue", alpha = 0.6) +
    labs(
      title = paste("Relationship between", predictor, "and Yield"),
      x = predictor,
      y = "Yield"
    ) +
    theme_minimal()
  
  print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Yes, these scatter plots reveal meaningful intuition about the relationship between key biological and process variables and the yield. The visualizations show that several features—particularly ManufacturingProcess32, BiologicalMaterial06, and ManufacturingProcess13—demonstrate clear linear or non-linear trends with the yield. For example, ManufacturingProcess32 has a strong positive linear relationship with yield, which aligns with its consistently high importance score in the models. Similarly, BiologicalMaterial06 and BiologicalMaterial03 show upward trends, suggesting that as the value of these biological material attributes increases, so does the yield. On the other hand, variables like ManufacturingProcess36 and ManufacturingProcess06 show more scattered relationships, indicating weaker or more complex interactions with the target. Overall, the plots support the importance rankings generated from the model and highlight that certain process and biological factors are strongly associated with yield, offering both predictive power and domain insight.