DATA 624 Homework 8

Author

Henock Montcho

Published

May 16, 2025

1-) Friedman (1991) introduced several benchmark data sets create by sim-ulation. One of these simulations used the following nonlinear equation to create data: y =10 sin(πx1x2) + 20(x3 − 0.5)2 +10x4 +5x5 +N(0,σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simula-tion). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(caret)
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision: 
testData <- mlbench.friedman1(5000, sd = 1) 
testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

library(caret)
knnModel <-train(x = trainingData$x,
                 y = trainingData$y, 
                 method = "knn",
                 preProc = c("center", "scale"), 
                 tuneLength = 10)
knnModel
k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  3.466085  0.5121775  2.816838
   7  3.349428  0.5452823  2.727410
   9  3.264276  0.5785990  2.660026
  11  3.214216  0.6024244  2.603767
  13  3.196510  0.6176570  2.591935
  15  3.184173  0.6305506  2.577482
  17  3.183130  0.6425367  2.567787
  19  3.198752  0.6483184  2.592683
  21  3.188993  0.6611428  2.588787
  23  3.200458  0.6638353  2.604529

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
knnPred <-predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set perforamnce values
postResample(pred = knnPred, obs = testData$y)
     RMSE  Rsquared       MAE 
3.2040595 0.6819919 2.5683461 
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <- findCorrelation(cor(trainingData$x), cutoff = .75)
# returns an empty variable

# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))


# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

set.seed(100)

# tune
nnetTune <- train(trainingData$x, trainingData$y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

nnetTune
Neural Network 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  decay  size  RMSE      Rsquared   MAE     
  0.00    1    2.540546  0.7254252  2.008197
  0.00    2    2.655140  0.7062546  2.133980
  0.00    3    2.308948  0.7717065  1.813897
  0.00    4    2.268677  0.8065299  1.791016
  0.00    5    2.491449  0.7556790  1.938877
  0.00    6    3.445760  0.6172989  2.291658
  0.00    7    5.259894  0.5137135  3.140884
  0.00    8    5.096729  0.4494295  3.299397
  0.00    9    6.724966  0.5040399  4.091065
  0.00   10    3.529274  0.6008843  2.765820
  0.01    1    2.385136  0.7603460  1.887587
  0.01    2    2.583767  0.7260485  2.018814
  0.01    3    2.282621  0.7815267  1.812073
  0.01    4    2.274770  0.7901402  1.842674
  0.01    5    2.653199  0.7237241  2.117160
  0.01    6    2.753791  0.7153836  2.190768
  0.01    7    2.799525  0.7123252  2.209083
  0.01    8    3.342579  0.6050358  2.630422
  0.01    9    3.795128  0.6115537  2.873952
  0.01   10    3.453153  0.6008848  2.824870
  0.10    1    2.394058  0.7596252  1.894319
  0.10    2    2.618767  0.7152952  2.117662
  0.10    3    2.580239  0.7353788  2.005527
  0.10    4    2.442308  0.7777448  1.907659
  0.10    5    2.617403  0.7227439  2.059628
  0.10    6    2.543814  0.7549811  2.060699
  0.10    7    2.811804  0.6959408  2.149668
  0.10    8    2.900555  0.6937553  2.336332
  0.10    9    3.101142  0.6579730  2.481152
  0.10   10    2.973902  0.6608165  2.360836

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 4 and decay = 0.
nnPred <- predict(nnetTune, testData$x)

postResample(nnPred, testData$y)
     RMSE  Rsquared       MAE 
2.4700280 0.7762282 1.9078363 
library(earth)
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

# tune
marsTune <- train(trainingData$x, trainingData$y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsTune
Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE     
  1        2      4.327937  0.2544880  3.600474
  1        3      3.572450  0.4912720  2.895811
  1        4      2.596841  0.7183600  2.106341
  1        5      2.370161  0.7659777  1.918669
  1        6      2.276141  0.7881481  1.810001
  1        7      1.766728  0.8751831  1.390215
  1        8      1.780946  0.8723243  1.401345
  1        9      1.665091  0.8819775  1.325515
  1       10      1.663804  0.8821283  1.327657
  1       11      1.657738  0.8822967  1.331730
  1       12      1.653784  0.8827903  1.331504
  1       13      1.648496  0.8823663  1.316407
  1       14      1.639073  0.8841742  1.312833
  1       15      1.639073  0.8841742  1.312833
  1       16      1.639073  0.8841742  1.312833
  1       17      1.639073  0.8841742  1.312833
  1       18      1.639073  0.8841742  1.312833
  1       19      1.639073  0.8841742  1.312833
  1       20      1.639073  0.8841742  1.312833
  1       21      1.639073  0.8841742  1.312833
  1       22      1.639073  0.8841742  1.312833
  1       23      1.639073  0.8841742  1.312833
  1       24      1.639073  0.8841742  1.312833
  1       25      1.639073  0.8841742  1.312833
  1       26      1.639073  0.8841742  1.312833
  1       27      1.639073  0.8841742  1.312833
  1       28      1.639073  0.8841742  1.312833
  1       29      1.639073  0.8841742  1.312833
  1       30      1.639073  0.8841742  1.312833
  1       31      1.639073  0.8841742  1.312833
  1       32      1.639073  0.8841742  1.312833
  1       33      1.639073  0.8841742  1.312833
  1       34      1.639073  0.8841742  1.312833
  1       35      1.639073  0.8841742  1.312833
  1       36      1.639073  0.8841742  1.312833
  1       37      1.639073  0.8841742  1.312833
  1       38      1.639073  0.8841742  1.312833
  2        2      4.327937  0.2544880  3.600474
  2        3      3.572450  0.4912720  2.895811
  2        4      2.661826  0.7070510  2.173471
  2        5      2.404015  0.7578971  1.975387
  2        6      2.243927  0.7914805  1.783072
  2        7      1.856336  0.8605482  1.435682
  2        8      1.754607  0.8763186  1.396841
  2        9      1.603578  0.8938666  1.261361
  2       10      1.492421  0.9084998  1.168700
  2       11      1.317350  0.9292504  1.033926
  2       12      1.304327  0.9320133  1.019108
  2       13      1.277510  0.9323681  1.002927
  2       14      1.269626  0.9350024  1.003346
  2       15      1.266217  0.9359400  1.013893
  2       16      1.268470  0.9354868  1.011414
  2       17      1.268470  0.9354868  1.011414
  2       18      1.268470  0.9354868  1.011414
  2       19      1.268470  0.9354868  1.011414
  2       20      1.268470  0.9354868  1.011414
  2       21      1.268470  0.9354868  1.011414
  2       22      1.268470  0.9354868  1.011414
  2       23      1.268470  0.9354868  1.011414
  2       24      1.268470  0.9354868  1.011414
  2       25      1.268470  0.9354868  1.011414
  2       26      1.268470  0.9354868  1.011414
  2       27      1.268470  0.9354868  1.011414
  2       28      1.268470  0.9354868  1.011414
  2       29      1.268470  0.9354868  1.011414
  2       30      1.268470  0.9354868  1.011414
  2       31      1.268470  0.9354868  1.011414
  2       32      1.268470  0.9354868  1.011414
  2       33      1.268470  0.9354868  1.011414
  2       34      1.268470  0.9354868  1.011414
  2       35      1.268470  0.9354868  1.011414
  2       36      1.268470  0.9354868  1.011414
  2       37      1.268470  0.9354868  1.011414
  2       38      1.268470  0.9354868  1.011414

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 15 and degree = 2.
marsPred <- predict(marsTune, testData$x)

postResample(marsPred, testData$y)
     RMSE  Rsquared       MAE 
1.1589948 0.9460418 0.9250230 
library(kernlab)
set.seed(101)

# tune
svmRTune <- train(trainingData$x, trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRTune
Support Vector Machines with Radial Basis Function Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  C        RMSE      Rsquared   MAE     
     0.25  2.518451  0.7977688  2.010337
     0.50  2.271316  0.8116556  1.804299
     1.00  2.106614  0.8331518  1.662671
     2.00  2.019537  0.8441622  1.570576
     4.00  1.939589  0.8559148  1.516878
     8.00  1.904125  0.8612118  1.497448
    16.00  1.900928  0.8620090  1.502851
    32.00  1.900928  0.8620090  1.502851
    64.00  1.900928  0.8620090  1.502851
   128.00  1.900928  0.8620090  1.502851
   256.00  1.900928  0.8620090  1.502851
   512.00  1.900928  0.8620090  1.502851
  1024.00  1.900928  0.8620090  1.502851
  2048.00  1.900928  0.8620090  1.502851

Tuning parameter 'sigma' was held constant at a value of 0.06172165
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06172165 and C = 16.
svmRPred <- predict(svmRTune, testData$x)

postResample(svmRPred, testData$y)
     RMSE  Rsquared       MAE 
2.0708161 0.8261186 1.5729691 

Which models appear to give the best performance?

Comment: MARS demonstrates the strongest performance, achieving the lowest RMSE and MAE along with the highest R² value. SVM ranks second in terms of performance.

Does MARS select the informative predictors (those named X1–X5)?

varImp(marsTune)
earth variable importance

   Overall
X1  100.00
X4   75.24
X2   48.73
X5   15.52
X3    0.00
plot(varImp(marsTune))

Comment: MARS selects the informative predictors (those named X1–X5) even though X3 has an importance of zero.

2-) Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# imputation
missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missing, ChemicalManufacturingProcess)

# filtering low frequencies
Chemical <- Chemical[, -nearZeroVar(Chemical)]

set.seed(1122)

# index for training
index <- createDataPartition(Chemical$Yield, p = .8, list = FALSE)

# train 
train_x <- Chemical[index, -1]
train_y <- Chemical[index, 1]

# test
test_x <- Chemical[-index, -1]
test_y <- Chemical[-index, 1]

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

knnModel <- train(train_x, train_y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

knnModel
k-Nearest Neighbors 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 144, 144, 144, 144, 144, 144, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  1.396642  0.3744077  1.114509
   7  1.376192  0.3917046  1.112997
   9  1.374055  0.3930803  1.124211
  11  1.368435  0.3997261  1.124392
  13  1.374638  0.3988759  1.134410
  15  1.379299  0.3990339  1.139894
  17  1.370991  0.4118993  1.134783
  19  1.373526  0.4128174  1.138457
  21  1.387453  0.4049173  1.150880
  23  1.393734  0.4067580  1.157569

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 11.
knnPred <- predict(knnModel, test_x)
postResample(pred = knnPred, test_y)
     RMSE  Rsquared       MAE 
1.5908207 0.5862244 1.2103693 
# remove predictors to ensure maximum abs pairwise corr between predictors < 0.75
tooHigh <- findCorrelation(cor(train_x), cutoff = .75)

# removing 21 variables
train_x_nnet <- train_x[, -tooHigh]
test_x_nnet <- test_x[, -tooHigh]

# create a tuning grid
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10))

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

set.seed(100)

# tune
nnetTune <- train(train_x_nnet, train_y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(train_x_nnet) + 1) + 10 + 1,
                  maxit = 500)

nnetTune
Neural Network 

144 samples
 37 predictor

Pre-processing: centered (37), scaled (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
Resampling results across tuning parameters:

  decay  size  RMSE      Rsquared    MAE     
  0.00    1    1.605147  0.22838387  1.326253
  0.00    2    1.542326  0.28503898  1.226476
  0.00    3    3.584369  0.25820694  2.618407
  0.00    4    3.722985  0.12717373  2.939418
  0.00    5    3.260115  0.18358386  2.601427
  0.00    6    3.467400  0.16913974  2.711412
  0.00    7    4.617206  0.10624310  3.422947
  0.00    8    6.054699  0.09265855  4.346180
  0.00    9    6.640577  0.18065879  4.334515
  0.00   10    7.142983  0.13527184  4.848827
  0.01    1    1.571353  0.38131816  1.292421
  0.01    2    2.174145  0.37909307  1.544274
  0.01    3    2.862542  0.24669420  2.208833
  0.01    4    3.026441  0.19521724  2.313311
  0.01    5    2.856393  0.17241600  2.190576
  0.01    6    2.653319  0.21673711  2.079070
  0.01    7    2.516029  0.22144524  2.020586
  0.01    8    2.536187  0.19610389  1.926985
  0.01    9    2.782617  0.14596066  2.248048
  0.01   10    2.857733  0.18869061  2.167975
  0.10    1    1.565384  0.35990675  1.256153
  0.10    2    1.802949  0.32563323  1.386549
  0.10    3    2.522964  0.24684902  2.000231
  0.10    4    2.945406  0.12890048  2.242661
  0.10    5    2.652436  0.10846934  2.081629
  0.10    6    2.771432  0.18231080  2.040593
  0.10    7    2.419471  0.22012438  1.935937
  0.10    8    2.674768  0.22893183  1.939827
  0.10    9    2.364701  0.27528189  1.853892
  0.10   10    2.343169  0.20853359  1.781560

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 2 and decay = 0.
nnPred <- predict(nnetTune, test_x_nnet)

postResample(nnPred, test_y)
     RMSE  Rsquared       MAE 
1.6352329 0.4515722 1.1869338 
# create a tuning grid
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100009)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsTune
Multivariate Adaptive Regression Spline 

144 samples
 56 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 130, 128, 130, 131, 130, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE      Rsquared   MAE      
  1        2      1.341438  0.4446601  1.0737081
  1        3      1.207259  0.5532391  0.9735850
  1        4      1.219471  0.5593102  0.9776334
  1        5      1.244175  0.5411752  0.9843855
  1        6      1.263590  0.5226554  1.0245809
  1        7      1.219323  0.5511849  0.9743043
  1        8      1.227823  0.5497759  1.0030936
  1        9      1.270475  0.5329809  1.0278800
  1       10      1.278102  0.5356383  1.0467257
  1       11      1.266671  0.5462880  1.0455230
  1       12      1.296903  0.5277882  1.0687290
  1       13      1.284006  0.5466799  1.0597154
  1       14      1.309123  0.5367861  1.0836185
  1       15      1.322817  0.5270802  1.0962354
  1       16      1.315613  0.5396160  1.0826213
  1       17      1.315816  0.5400492  1.0851170
  1       18      1.315816  0.5400492  1.0851170
  1       19      1.315816  0.5400492  1.0851170
  1       20      1.315816  0.5400492  1.0851170
  1       21      1.315816  0.5400492  1.0851170
  1       22      1.315816  0.5400492  1.0851170
  1       23      1.315816  0.5400492  1.0851170
  1       24      1.315816  0.5400492  1.0851170
  1       25      1.315816  0.5400492  1.0851170
  1       26      1.315816  0.5400492  1.0851170
  1       27      1.315816  0.5400492  1.0851170
  1       28      1.315816  0.5400492  1.0851170
  1       29      1.315816  0.5400492  1.0851170
  1       30      1.315816  0.5400492  1.0851170
  1       31      1.315816  0.5400492  1.0851170
  1       32      1.315816  0.5400492  1.0851170
  1       33      1.315816  0.5400492  1.0851170
  1       34      1.315816  0.5400492  1.0851170
  1       35      1.315816  0.5400492  1.0851170
  1       36      1.315816  0.5400492  1.0851170
  1       37      1.315816  0.5400492  1.0851170
  1       38      1.315816  0.5400492  1.0851170
  2        2      1.341438  0.4446601  1.0737081
  2        3      1.261439  0.5189796  1.0144065
  2        4      1.317336  0.5125364  1.0199475
  2        5      1.353076  0.4927882  1.0370695
  2        6      1.379426  0.4822155  1.0737506
  2        7      1.391965  0.4713774  1.0686494
  2        8      1.434591  0.4888686  1.0718269
  2        9      1.392080  0.5044489  1.0270856
  2       10      1.407816  0.5065727  1.0253324
  2       11      1.492846  0.4561781  1.0815316
  2       12      1.471280  0.4754442  1.0473993
  2       13      1.456659  0.4802637  1.0536210
  2       14      1.464228  0.4618070  1.0850904
  2       15      1.476440  0.4554687  1.0936283
  2       16      1.510031  0.4511219  1.1172838
  2       17      1.489304  0.4667694  1.0906567
  2       18      1.470880  0.4773174  1.0815107
  2       19      1.472741  0.4776061  1.0766876
  2       20      1.521745  0.4760915  1.0966481
  2       21      1.518777  0.4762490  1.0956059
  2       22      1.499180  0.4782054  1.0834936
  2       23      1.501226  0.4760950  1.0885354
  2       24      1.501226  0.4760950  1.0885354
  2       25      1.501226  0.4760950  1.0885354
  2       26      1.501226  0.4760950  1.0885354
  2       27      1.501226  0.4760950  1.0885354
  2       28      1.501226  0.4760950  1.0885354
  2       29      1.501226  0.4760950  1.0885354
  2       30      1.501226  0.4760950  1.0885354
  2       31      1.501226  0.4760950  1.0885354
  2       32      1.501226  0.4760950  1.0885354
  2       33      1.501226  0.4760950  1.0885354
  2       34      1.501226  0.4760950  1.0885354
  2       35      1.501226  0.4760950  1.0885354
  2       36      1.501226  0.4760950  1.0885354
  2       37      1.501226  0.4760950  1.0885354
  2       38      1.501226  0.4760950  1.0885354

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 3 and degree = 1.
marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)
    RMSE Rsquared      MAE 
1.419469 0.612023 1.137812 
set.seed(0909)

# tune
svmRTune <- train(train_x, train_y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = trainControl(method = "cv"))

svmRTune
Support Vector Machines with Radial Basis Function Kernel 

144 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 130, 130, 128, 131, 130, 130, ... 
Resampling results across tuning parameters:

  C        RMSE      Rsquared   MAE      
     0.25  1.334962  0.5230124  1.1059145
     0.50  1.235421  0.5541532  1.0318823
     1.00  1.177799  0.5889356  0.9935824
     2.00  1.114080  0.6334924  0.9355337
     4.00  1.113288  0.6336954  0.9350635
     8.00  1.146250  0.6038205  0.9595706
    16.00  1.157939  0.5957328  0.9707883
    32.00  1.157939  0.5957328  0.9707883
    64.00  1.157939  0.5957328  0.9707883
   128.00  1.157939  0.5957328  0.9707883
   256.00  1.157939  0.5957328  0.9707883
   512.00  1.157939  0.5957328  0.9707883
  1024.00  1.157939  0.5957328  0.9707883
  2048.00  1.157939  0.5957328  0.9707883

Tuning parameter 'sigma' was held constant at a value of 0.0131875
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.0131875 and C = 4.
svmRPred <- predict(svmRTune, test_x)

postResample(svmRPred, test_y)
     RMSE  Rsquared       MAE 
1.2836304 0.7007812 0.9736821 

Grouping the performance test results

rbind(knn = postResample(knnPred, test_y),
      nn = postResample(nnPred, test_y),
      mars = postResample(marsPred, test_y),
      svmR = postResample(svmRPred, test_y))
         RMSE  Rsquared       MAE
knn  1.590821 0.5862244 1.2103693
nn   1.635233 0.4515722 1.1869338
mars 1.419469 0.6120230 1.1378123
svmR 1.283630 0.7007812 0.9736821

Comment: SVM delivers the best performance, achieving the lowest RMSE and MAE, along with the highest R² value.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?

varImp(svmRTune)
loess r-squared variable importance

  only 20 most important variables shown (out of 56)

                       Overall
ManufacturingProcess32  100.00
BiologicalMaterial06     80.51
ManufacturingProcess36   72.83
ManufacturingProcess13   71.63
BiologicalMaterial03     67.20
ManufacturingProcess09   64.46
BiologicalMaterial12     60.68
ManufacturingProcess31   57.48
BiologicalMaterial02     57.30
ManufacturingProcess17   56.77
BiologicalMaterial11     49.00
BiologicalMaterial04     48.73
ManufacturingProcess33   47.31
ManufacturingProcess06   47.00
ManufacturingProcess29   39.67
ManufacturingProcess11   38.13
BiologicalMaterial08     33.62
BiologicalMaterial01     31.78
BiologicalMaterial09     30.20
ManufacturingProcess30   29.21
plot(varImp(svmRTune), top = 20)

Comment: Process variables make up the majority, with a ratio of 11 to 9—mirroring the distribution observed in the optimal linear model from Homework 7.

- How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
missing <- preProcess(ChemicalManufacturingProcess, method = "bagImpute")
Chemical <- predict(missing, ChemicalManufacturingProcess)

set.seed(9987)

larsTune <- train(Yield ~ ., Chemical , method = "lars", metric = "Rsquared",
                   tuneLength = 20, trControl = ctrl, preProc = c("center", "scale"))

plot(varImp(larsTune), top = 10,
     main = "Linear: Top 10 Important Predictors")

plot(varImp(svmRTune), top = 10,
     main = "Nonlinear: Top 10 Important Predictors")

Comment: The top ten most important predictors match those identified in the optimal linear model, which was the LARS model at the exception of one predictor that did not match in both models.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

library(corrplot)
library(dplyr)
top10 <- varImp(svmRTune)$importance  |>
  arrange(-Overall)  |>
  head(10)


Chemical  |>
  select(c("Yield", row.names(top10)))  |>
  cor()  |>
  corrplot()

Comment: From the correlation plot, it turns out that ManufacturingProcess32 has the highest positive correlation with Yield. Three of the top ten variables are negatively correlated with Yield.