Problem 7.2

Question

Friedman (1991) introduced several benchmark data sets created by simulation.
One of these simulations used the following nonlinear equation:

y = 10 sin(pix1x2) + 20(x3 - 0.5)^2 + 10x4 + 5x5 + N(0, sigma^2)

Tune several models on these data.

library(mlbench)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

featurePlot(trainingData$x, trainingData$y)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

# KNN
set.seed(200)
knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10,
  trControl = trainControl(method = "cv")
)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.238598  0.5836232  2.705822
##    7  3.117335  0.6295372  2.561052
##    9  3.100423  0.6590940  2.524483
##   11  3.086639  0.6822198  2.506584
##   13  3.094904  0.6902613  2.504433
##   15  3.116059  0.7045172  2.516131
##   17  3.129874  0.7133067  2.529370
##   19  3.151840  0.7183283  2.546422
##   21  3.175787  0.7209301  2.574113
##   23  3.208213  0.7146199  2.611285
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.

# MARS
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth",
  tuneGrid = marsGrid,
  trControl = trainControl(method = "cv")
)

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

marsModel

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.188280  0.3042527  3.460689
##   1        3      3.551182  0.4999832  2.837116
##   1        4      2.653143  0.7167280  2.128222
##   1        5      2.405769  0.7562160  1.948161
##   1        6      2.295006  0.7754603  1.853199
##   1        7      1.771950  0.8611767  1.391357
##   1        8      1.647182  0.8774867  1.299564
##   1        9      1.609816  0.8837307  1.299705
##   1       10      1.635035  0.8798236  1.309436
##   1       11      1.571915  0.8896147  1.260711
##   1       12      1.571561  0.8898750  1.253077
##   1       13      1.567577  0.8906927  1.250795
##   1       14      1.571673  0.8909652  1.245508
##   1       15      1.571673  0.8909652  1.245508
##   1       16      1.571673  0.8909652  1.245508
##   1       17      1.571673  0.8909652  1.245508
##   1       18      1.571673  0.8909652  1.245508
##   1       19      1.571673  0.8909652  1.245508
##   1       20      1.571673  0.8909652  1.245508
##   1       21      1.571673  0.8909652  1.245508
##   1       22      1.571673  0.8909652  1.245508
##   1       23      1.571673  0.8909652  1.245508
##   1       24      1.571673  0.8909652  1.245508
##   1       25      1.571673  0.8909652  1.245508
##   1       26      1.571673  0.8909652  1.245508
##   1       27      1.571673  0.8909652  1.245508
##   1       28      1.571673  0.8909652  1.245508
##   1       29      1.571673  0.8909652  1.245508
##   1       30      1.571673  0.8909652  1.245508
##   1       31      1.571673  0.8909652  1.245508
##   1       32      1.571673  0.8909652  1.245508
##   1       33      1.571673  0.8909652  1.245508
##   1       34      1.571673  0.8909652  1.245508
##   1       35      1.571673  0.8909652  1.245508
##   1       36      1.571673  0.8909652  1.245508
##   1       37      1.571673  0.8909652  1.245508
##   1       38      1.571673  0.8909652  1.245508
##   2        2      4.188280  0.3042527  3.460689
##   2        3      3.551182  0.4999832  2.837116
##   2        4      2.615256  0.7216809  2.128763
##   2        5      2.344223  0.7683855  1.890080
##   2        6      2.275048  0.7762472  1.807779
##   2        7      1.841464  0.8418935  1.457945
##   2        8      1.641647  0.8839822  1.288520
##   2        9      1.535119  0.9002991  1.214772
##   2       10      1.473254  0.9101555  1.158761
##   2       11      1.379476  0.9207735  1.080991
##   2       12      1.285380  0.9283193  1.033426
##   2       13      1.267261  0.9328905  1.014726
##   2       14      1.261797  0.9327541  1.009821
##   2       15      1.266663  0.9320714  1.005751
##   2       16      1.270858  0.9322465  1.009757
##   2       17      1.263778  0.9327687  1.007653
##   2       18      1.263778  0.9327687  1.007653
##   2       19      1.263778  0.9327687  1.007653
##   2       20      1.263778  0.9327687  1.007653
##   2       21      1.263778  0.9327687  1.007653
##   2       22      1.263778  0.9327687  1.007653
##   2       23      1.263778  0.9327687  1.007653
##   2       24      1.263778  0.9327687  1.007653
##   2       25      1.263778  0.9327687  1.007653
##   2       26      1.263778  0.9327687  1.007653
##   2       27      1.263778  0.9327687  1.007653
##   2       28      1.263778  0.9327687  1.007653
##   2       29      1.263778  0.9327687  1.007653
##   2       30      1.263778  0.9327687  1.007653
##   2       31      1.263778  0.9327687  1.007653
##   2       32      1.263778  0.9327687  1.007653
##   2       33      1.263778  0.9327687  1.007653
##   2       34      1.263778  0.9327687  1.007653
##   2       35      1.263778  0.9327687  1.007653
##   2       36      1.263778  0.9327687  1.007653
##   2       37      1.263778  0.9327687  1.007653
##   2       38      1.263778  0.9327687  1.007653
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

# SVM Radial
set.seed(200)
svmModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "svmRadial",
  preProc = c("center", "scale"),
  tuneLength = 14,
  trControl = trainControl(method = "cv")
)
svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.525164  0.7810576  2.010680
##      0.50  2.270567  0.7944850  1.794902
##      1.00  2.099319  0.8155594  1.659342
##      2.00  2.005858  0.8302852  1.578799
##      4.00  1.934650  0.8435677  1.528373
##      8.00  1.915653  0.8475592  1.528614
##     16.00  1.923884  0.8463090  1.535976
##     32.00  1.923884  0.8463090  1.535976
##     64.00  1.923884  0.8463090  1.535976
##    128.00  1.923884  0.8463090  1.535976
##    256.00  1.923884  0.8463090  1.535976
##    512.00  1.923884  0.8463090  1.535976
##   1024.00  1.923884  0.8463090  1.535976
##   2048.00  1.923884  0.8463090  1.535976
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 8.

# Neural Network
set.seed(200)
nnetGrid <- expand.grid(
  .decay = c(0, 0.01, 0.1),
  .size  = c(1:10),
  .bag   = FALSE
)
nnetModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "avNNet",
  tuneGrid = nnetGrid,
  preProc = c("center", "scale"),
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  trControl = trainControl(method = "cv")
)

## Warning: executing %dopar% sequentially: no parallel backend registered

nnetModel

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.399955  0.7641657  1.892591
##   0.00    2    2.422496  0.7597612  1.940053
##   0.00    3    2.048209  0.8173992  1.637855
##   0.00    4    1.942073  0.8365333  1.554821
##   0.00    5    2.269670  0.7944413  1.738737
##   0.00    6    3.145864  0.7121045  2.242803
##   0.00    7    4.255063  0.4944622  2.730696
##   0.00    8    5.087452  0.5248100  2.898824
##   0.00    9    4.852871  0.5247125  2.790581
##   0.00   10    3.634054  0.6323073  2.472412
##   0.01    1    2.380902  0.7641801  1.871310
##   0.01    2    2.456920  0.7487966  1.925584
##   0.01    3    2.152617  0.8037267  1.690709
##   0.01    4    1.926277  0.8453343  1.547265
##   0.01    5    2.143562  0.8074224  1.717004
##   0.01    6    2.140588  0.8081466  1.696456
##   0.01    7    2.379436  0.7716195  1.850761
##   0.01    8    2.344241  0.7796374  1.845464
##   0.01    9    2.287667  0.7808732  1.739273
##   0.01   10    2.306110  0.7895454  1.833795
##   0.10    1    2.392295  0.7614560  1.873844
##   0.10    2    2.437044  0.7557121  1.918838
##   0.10    3    2.136581  0.8043180  1.702667
##   0.10    4    2.009700  0.8245206  1.574397
##   0.10    5    2.015255  0.8346002  1.586707
##   0.10    6    2.030267  0.8295288  1.586079
##   0.10    7    2.129594  0.8106161  1.699485
##   0.10    8    2.205965  0.8012470  1.719758
##   0.10    9    2.230354  0.8005457  1.736719
##   0.10   10    2.371577  0.7699950  1.866457
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.

# Compare via resampling
results72 <- resamples(list(
  KNN       = knnModel,
  MARS      = marsModel,
  SVM       = svmModel,
  NeuralNet = nnetModel
))
summary(results72)

## 
## Call:
## summary.resamples(object = results72)
## 
## Models: KNN, MARS, SVM, NeuralNet 
## Number of resamples: 10 
## 
## MAE 
##                Min.  1st Qu.    Median     Mean  3rd Qu.     Max. NA's
## KNN       1.9925181 2.203288 2.3861486 2.506584 2.768301 3.319190    0
## MARS      0.7326193 0.813534 0.9653755 1.009821 1.209518 1.395006    0
## SVM       1.2043965 1.425654 1.5320210 1.528614 1.564950 1.878901    0
## NeuralNet 1.0007594 1.431442 1.5112167 1.547265 1.687422 1.987858    0
## 
## RMSE 
##                Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## KNN       2.5824845 2.797964 3.039929 3.086639 3.292676 3.858210    0
## MARS      0.8900136 1.026360 1.253189 1.261797 1.430436 1.691489    0
## SVM       1.4519357 1.834071 1.918422 1.915653 2.026748 2.362976    0
## NeuralNet 1.2308634 1.791444 1.838269 1.926277 2.088252 2.670345    0
## 
## Rsquared 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## KNN       0.4810928 0.6450407 0.6663713 0.6822198 0.7565689 0.8154310    0
## MARS      0.8759205 0.9008865 0.9424540 0.9327541 0.9642843 0.9832637    0
## SVM       0.7329985 0.8369792 0.8625555 0.8475592 0.8685317 0.9254094    0
## NeuralNet 0.6430600 0.8050521 0.8833792 0.8453343 0.8963460 0.9313138    0

# Test set performance
postResample(pred = predict(knnModel,  newdata = testData$x), obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.1222641 0.6690472 2.4963650

postResample(pred = predict(marsModel, newdata = testData$x), obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

postResample(pred = predict(svmModel,  newdata = testData$x), obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0541197 0.8290353 1.5586411

postResample(pred = predict(nnetModel, newdata = testData$x), obs = testData$y)

##      RMSE  Rsquared       MAE 
## 2.0603975 0.8320657 1.5289921

# Does MARS select informative predictors X1-X5?
varImp(marsModel)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.74
## X5   15.53
## X3    0.00

Final Answer

The Friedman (1991) simulated dataset was used to compare several nonlinear regression models, including K-Nearest Neighbors (KNN), MARS, Support Vector Machines (SVM), and Neural Networks. From the feature plot, it looked like predictors X1–X5 had stronger relationships with the response, while X6–X10 looked more random and did not appear to contribute much.

Based on the cross-validation results, the MARS model performed the best overall. It had the lowest RMSE of 1.2618, the highest R-squared of 0.9328, and the lowest MAE of 1.0098 compared to the other models. The final MARS model selected was a second-degree model with degree = 2 and nprune = 14, which allowed it to capture nonlinear patterns and interactions better than the other methods.

The test set results also confirmed that MARS was the strongest model. Its test RMSE was 1.1723 with an R-squared of 0.9449, which was much better than KNN (RMSE = 3.1223), SVM (RMSE = 2.0541), and Neural Networks (RMSE = 2.0604).

For the question about whether MARS selected the informative predictors, the variable importance results showed that it correctly identified the important variables from the original Friedman equation. The top predictors were X1, X4, X2, X5, and X3. Since the true model was built using X1 through X5, this shows that MARS successfully identified the real informative predictors and did not give importance to the noise variables X6–X10.

Overall, MARS gave the best performance and also did the best job of identifying the true predictors driving the response.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
library(caret)

data(ChemicalManufacturingProcess)

x_raw <- ChemicalManufacturingProcess[, -1]
y_raw <- ChemicalManufacturingProcess[, 1]

# Impute missing values
imputer   <- preProcess(x_raw, method = "knnImpute")
x_imputed <- predict(imputer, x_raw)

# Remove near-zero variance predictors
nzv     <- nearZeroVar(x_imputed)
x_clean <- x_imputed[, -nzv]

# Train/test split
set.seed(200)
trainIndex <- createDataPartition(y_raw, p = 0.8, list = FALSE)

x_train <- x_clean[trainIndex, ]
x_test  <- x_clean[-trainIndex, ]
y_train <- y_raw[trainIndex]
y_test  <- y_raw[-trainIndex]

# Neural Network
set.seed(200)
nnetGrid <- expand.grid(
  .decay = c(0, 0.01, 0.1),
  .size  = c(1:10),
  .bag   = FALSE
)
nnet_model <- train(
  x = x_train,
  y = y_train,
  method = "avNNet",
  tuneGrid = nnetGrid,
  preProc = c("center", "scale"),
  linout = TRUE,
  trace = FALSE,
  maxit = 500,
  trControl = trainControl(method = "cv")
)
nnet_model

## Model Averaged Neural Network 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.523363  0.3604747  1.234126
##   0.00    2    1.520574  0.3503214  1.209887
##   0.00    3    1.764840  0.3162132  1.440432
##   0.00    4    2.103388  0.2918259  1.658930
##   0.00    5    2.023664  0.3056474  1.504181
##   0.00    6    2.105432  0.2747294  1.667130
##   0.00    7    2.751783  0.2608817  2.096610
##   0.00    8    4.014804  0.1738079  3.004669
##   0.00    9    5.360347  0.2653903  3.734766
##   0.00   10    7.660111  0.1505424  5.132431
##   0.01    1    1.544400  0.3987394  1.214985
##   0.01    2    1.479339  0.4718284  1.207249
##   0.01    3    1.419732  0.5061298  1.161174
##   0.01    4    1.567317  0.4243090  1.270015
##   0.01    5    1.851476  0.3654024  1.376861
##   0.01    6    1.520343  0.4686866  1.180328
##   0.01    7    1.456154  0.4675767  1.201409
##   0.01    8    1.709202  0.4808853  1.367491
##   0.01    9    2.310405  0.3996200  1.717039
##   0.01   10    2.385714  0.2537867  1.775583
##   0.10    1    1.469752  0.4537444  1.177949
##   0.10    2    1.655718  0.4423940  1.273301
##   0.10    3    1.592602  0.4914365  1.251160
##   0.10    4    1.881032  0.3930018  1.336763
##   0.10    5    1.724776  0.4863146  1.266150
##   0.10    6    1.729358  0.4206057  1.264146
##   0.10    7    1.684171  0.4585358  1.252075
##   0.10    8    1.912418  0.3658850  1.377345
##   0.10    9    1.619666  0.4384281  1.219826
##   0.10   10    1.512825  0.4127407  1.239338
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3, decay = 0.01 and bag = FALSE.

# MARS
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
mars_model <- train(
  x = x_train,
  y = y_train,
  method = "earth",
  tuneGrid = marsGrid,
  trControl = trainControl(method = "cv")
)
mars_model

## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.358846  0.4736659  1.0773044
##   1        3      1.221749  0.5615016  0.9872372
##   1        4      1.237128  0.5553756  0.9997244
##   1        5      1.266570  0.5295067  1.0275782
##   1        6      1.284562  0.5203949  1.0342975
##   1        7      1.273451  0.5390375  1.0041517
##   1        8      1.255157  0.5477567  0.9723961
##   1        9      1.260950  0.5322269  0.9852424
##   1       10      1.232531  0.5560667  0.9876363
##   1       11      1.238162  0.5487083  0.9746232
##   1       12      1.245077  0.5463104  0.9696214
##   1       13      1.243699  0.5490848  0.9714423
##   1       14      1.269269  0.5379033  0.9808639
##   1       15      1.268386  0.5369130  0.9779810
##   1       16      1.261876  0.5402092  0.9747751
##   1       17      1.262785  0.5392306  0.9771557
##   1       18      1.262785  0.5392306  0.9771557
##   1       19      1.262785  0.5392306  0.9771557
##   1       20      1.262785  0.5392306  0.9771557
##   1       21      1.262785  0.5392306  0.9771557
##   1       22      1.262785  0.5392306  0.9771557
##   1       23      1.262785  0.5392306  0.9771557
##   1       24      1.262785  0.5392306  0.9771557
##   1       25      1.262785  0.5392306  0.9771557
##   1       26      1.262785  0.5392306  0.9771557
##   1       27      1.262785  0.5392306  0.9771557
##   1       28      1.262785  0.5392306  0.9771557
##   1       29      1.262785  0.5392306  0.9771557
##   1       30      1.262785  0.5392306  0.9771557
##   1       31      1.262785  0.5392306  0.9771557
##   1       32      1.262785  0.5392306  0.9771557
##   1       33      1.262785  0.5392306  0.9771557
##   1       34      1.262785  0.5392306  0.9771557
##   1       35      1.262785  0.5392306  0.9771557
##   1       36      1.262785  0.5392306  0.9771557
##   1       37      1.262785  0.5392306  0.9771557
##   1       38      1.262785  0.5392306  0.9771557
##   2        2      1.358846  0.4736659  1.0773044
##   2        3      1.237063  0.5681071  1.0074004
##   2        4      1.222109  0.5758947  0.9771310
##   2        5      1.245134  0.5552182  0.9875915
##   2        6      1.364939  0.4650395  1.0745256
##   2        7      1.398639  0.4538580  1.0940460
##   2        8      1.407908  0.4489776  1.0994214
##   2        9      1.405588  0.4509498  1.0950718
##   2       10      1.369356  0.4729875  1.0481172
##   2       11      1.397721  0.4583520  1.0796357
##   2       12      1.378682  0.4749164  1.0511213
##   2       13      1.359249  0.4821028  1.0296717
##   2       14      1.345667  0.4941844  1.0238402
##   2       15      1.329379  0.5129324  1.0005878
##   2       16      1.358357  0.4971061  1.0225444
##   2       17      1.334597  0.5107436  0.9910424
##   2       18      1.314057  0.5179948  0.9768382
##   2       19      1.353755  0.5111639  1.0136810
##   2       20      1.322935  0.5269341  0.9912407
##   2       21      1.328934  0.5278357  1.0050881
##   2       22      1.332035  0.5270839  1.0045313
##   2       23      1.335219  0.5254110  1.0136223
##   2       24      1.332371  0.5267988  1.0114858
##   2       25      1.327000  0.5320395  1.0073512
##   2       26      1.327171  0.5343851  1.0107119
##   2       27      1.327171  0.5343851  1.0107119
##   2       28      1.327171  0.5343851  1.0107119
##   2       29      1.327171  0.5343851  1.0107119
##   2       30      1.327171  0.5343851  1.0107119
##   2       31      1.327171  0.5343851  1.0107119
##   2       32      1.327171  0.5343851  1.0107119
##   2       33      1.327171  0.5343851  1.0107119
##   2       34      1.327171  0.5343851  1.0107119
##   2       35      1.327171  0.5343851  1.0107119
##   2       36      1.327171  0.5343851  1.0107119
##   2       37      1.327171  0.5343851  1.0107119
##   2       38      1.327171  0.5343851  1.0107119
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.

# SVM Radial
set.seed(200)
svm_model <- train(
  x = x_train,
  y = y_train,
  method = "svmRadial",
  preProc = c("center", "scale"),
  tuneLength = 14,
  trControl = trainControl(method = "cv")
)
svm_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.363690  0.4910338  1.1102265
##      0.50  1.264912  0.5396158  1.0224011
##      1.00  1.169609  0.6060516  0.9313862
##      2.00  1.132682  0.6373329  0.8991576
##      4.00  1.129125  0.6401732  0.9010163
##      8.00  1.128698  0.6383315  0.8946774
##     16.00  1.123462  0.6414746  0.8898824
##     32.00  1.123462  0.6414746  0.8898824
##     64.00  1.123462  0.6414746  0.8898824
##    128.00  1.123462  0.6414746  0.8898824
##    256.00  1.123462  0.6414746  0.8898824
##    512.00  1.123462  0.6414746  0.8898824
##   1024.00  1.123462  0.6414746  0.8898824
##   2048.00  1.123462  0.6414746  0.8898824
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01364473
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01364473 and C = 16.

# KNN
set.seed(200)
knn_model <- train(
  x = x_train,
  y = y_train,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10,
  trControl = trainControl(method = "cv")
)
knn_model

## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    5  1.211973  0.5736136  0.9610462
##    7  1.251504  0.5430299  0.9936143
##    9  1.280785  0.5216002  1.0177954
##   11  1.304234  0.5067294  1.0444028
##   13  1.341687  0.4797393  1.0793004
##   15  1.368752  0.4557780  1.0925164
##   17  1.360050  0.4613685  1.0910232
##   19  1.372151  0.4589683  1.1076356
##   21  1.392773  0.4451492  1.1284405
##   23  1.384681  0.4581734  1.1157607
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

Which nonlinear regression model gives the optimal resampling and test set performance?

# Compare all models via resampling and test set performance
results <- resamples(list(
  NeuralNet = nnet_model,
  MARS      = mars_model,
  SVM       = svm_model,
  KNN       = knn_model
))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: NeuralNet, MARS, SVM, KNN 
## Number of resamples: 10 
## 
## MAE 
##                Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## NeuralNet 0.9020430 0.9305145 1.0824023 1.1611743 1.2390664 2.032501    0
## MARS      0.8272437 0.8876048 0.9199464 0.9872372 1.0003701 1.380789    0
## SVM       0.6441756 0.7438694 0.8932662 0.8898824 0.9980542 1.197383    0
## KNN       0.6154667 0.8431071 0.9252051 0.9610462 1.0807872 1.445846    0
## 
## RMSE 
##                Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## NeuralNet 1.0116140 1.1301716 1.331673 1.419732 1.503636 2.554810    0
## MARS      1.0359253 1.0810183 1.148108 1.221749 1.236913 1.838730    0
## SVM       0.8062012 0.8782177 1.067869 1.123462 1.268242 1.884572    0
## KNN       0.8854909 1.0221954 1.142363 1.211973 1.293895 1.988005    0
## 
## Rsquared 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## NeuralNet 0.06229665 0.4146912 0.5461120 0.5061298 0.6234512 0.7291930    0
## MARS      0.37619797 0.4779541 0.5645000 0.5615016 0.6573148 0.6903550    0
## SVM       0.32326221 0.5300366 0.6847175 0.6414746 0.7831806 0.8035011    0
## KNN       0.27046464 0.4533846 0.6180339 0.5736136 0.6796181 0.7832414    0

postResample(predict(nnet_model, x_test), y_test)

##     RMSE Rsquared      MAE 
## 2.380885 0.205016 1.610862

postResample(predict(mars_model, x_test), y_test)

##      RMSE  Rsquared       MAE 
## 1.4045894 0.5224818 1.0821086

postResample(predict(svm_model,  x_test), y_test)

##      RMSE  Rsquared       MAE 
## 1.3289206 0.6243324 0.9940246

postResample(predict(knn_model,  x_test), y_test)

##      RMSE  Rsquared       MAE 
## 1.5958081 0.4470237 1.2768125

Based on both the cross-validation results and the test set performance, the SVM Radial model performed the best overall.

From the resampling results, SVM had the lowest average RMSE (1.1235), the highest average R-squared (0.6415), and the lowest MAE (0.8899) compared to Neural Networks, MARS, and KNN. This shows that SVM gave the most consistent performance during cross-validation.

The test set results also confirmed that SVM was the strongest model. Its test RMSE was 1.3289 with an R-squared of 0.6243 and MAE of 0.9940, which was better than the other models:

Neural Network: RMSE = 2.3809
MARS: RMSE = 1.4046
KNN: RMSE = 1.5958

Even though MARS performed well, SVM had better overall test performance and handled the nonlinear relationships in the chemical manufacturing process data more effectively.

Overall, the SVM Radial model was the best nonlinear regression model for this dataset.

Which predictors are most important in the optimal nonlinear regres- sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

# Variable importance from best model
test_rmse <- c(
  NeuralNet = postResample(predict(nnet_model, x_test), y_test)[["RMSE"]],
  MARS      = postResample(predict(mars_model, x_test), y_test)[["RMSE"]],
  SVM       = postResample(predict(svm_model,  x_test), y_test)[["RMSE"]],
  KNN       = postResample(predict(knn_model,  x_test), y_test)[["RMSE"]]
)

best_name <- names(which.min(test_rmse))
best_name

## [1] "SVM"

model_list <- list(
  NeuralNet = nnet_model,
  MARS      = mars_model,
  SVM       = svm_model,
  KNN       = knn_model
)

best_model <- model_list[[best_name]]
varImp(best_model)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     87.81
## ManufacturingProcess13   78.23
## BiologicalMaterial03     76.45
## BiologicalMaterial12     69.16
## ManufacturingProcess17   68.44
## ManufacturingProcess31   67.32
## ManufacturingProcess36   67.18
## ManufacturingProcess09   64.12
## ManufacturingProcess06   60.80
## BiologicalMaterial02     53.50
## ManufacturingProcess29   53.10
## BiologicalMaterial11     48.66
## ManufacturingProcess11   48.25
## ManufacturingProcess33   46.61
## ManufacturingProcess30   45.38
## BiologicalMaterial09     38.34
## BiologicalMaterial04     37.89
## BiologicalMaterial08     36.89
## ManufacturingProcess12   36.65

The most important predictors in the best nonlinear model (SVM Radial) were mainly a mix of both manufacturing process variables and biological material variables.

The top predictors were:

ManufacturingProcess32
BiologicalMaterial06
ManufacturingProcess13
BiologicalMaterial03
BiologicalMaterial12
ManufacturingProcess17
ManufacturingProcess31
ManufacturingProcess36
ManufacturingProcess09
ManufacturingProcess06

From these results, manufacturing process variables appear to dominate the list since most of the top ten predictors come from the ManufacturingProcess group. However, several biological variables such as BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial12 were also highly important, showing that both groups contribute to predicting yield.

Compared to the optimal linear model from Exercise 6.3, many of the same predictors still appear important, especially ManufacturingProcess variables like 32, 17, and 13. This suggests that these variables have strong influence on yield regardless of whether a linear or nonlinear model is used.

Overall, the process variables were slightly more dominant, but both biological and process predictors were important in explaining the final product yield.

Explore the relationshipsbetween the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# top 10 predictors vs yield
imp <- varImp(best_model)$importance
imp$Predictor <- rownames(imp)
imp$Overall   <- imp[, 1]
imp <- imp[order(-imp$Overall), ]
top_names <- imp$Predictor[1:10]

par(mfrow = c(2, 5))
for (v in top_names) {
  plot(
    x_train[, v],
    y_train,
    xlab = v,
    ylab = "Yield",
    main = v
  )
  abline(lm(y_train ~ x_train[, v]))
}

The plots show that several of the top predictors have clear relationships with yield, especially ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess09.

Some predictors show a positive relationship with yield, such as ManufacturingProcess32, BiologicalMaterial06, and ManufacturingProcess09, where higher values are generally associated with higher yield. Other predictors, such as ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess31, show a negative relationship, where larger values are associated with lower yield.

A few predictors, like ManufacturingProcess36, show weaker or less consistent patterns, which suggests that their effect may be more complex or may depend on interactions with other variables rather than a simple straight-line relationship.

These plots suggest that manufacturing process variables have a stronger and more direct impact on yield than biological material variables. This makes sense because process settings often directly control production efficiency and final output quality. Biological variables still matter, but their effects appear less dominant compared to the process variables.

Overall, the plots support the idea that both biological and process predictors affect yield, but manufacturing process variables seem to have the strongest influence in the nonlinear model.

Data624_Homework8_ahmed_hassan

Ahmed Hassan

2026-04-26

Problem 7.2

Question

Final Answer