Kuhn and Johnson, Chapter 7

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. The package mlbench contains a function called mlbench.friedman1 that simulates these data.

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
# We convert the 'x' data from a matrix to a data frame
# One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)

# Look at the data using
featurePlot(trainingData$x, trainingData$y)

# or other methods.

# This creates a list with a vector 'y' and a matrix
# of predictors 'x'. Also simulate a large test set to
# estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Tune several models on these data.

We will train a few different non-linear models to the data.

KNN Model

knnModel <- train(x = trainingData$x, 
                  y = trainingData$y, 
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
# The function 'postResample' can be used to get the test set
# perforamnce values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Neural Network Model

First, we need to remove any variables that are too highly correlated, as this will adversely affect the model.

Let’s see if there are any highly correlated variables.

findCorrelation(cor(trainingData$x), cutoff = 0.7)
## integer(0)

There are no highly correlated variables. We can proceed with a neural network model..

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), 
                        .size = c(1:10))

set.seed(613)
nnetTune <- train(trainingData$x,
                  trainingData$y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv"),
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500)

nnetTune
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.403507  0.7743189  1.901555
##   0.00    2    2.566132  0.7370821  2.138302
##   0.00    3    2.200392  0.8013902  1.711597
##   0.00    4    2.284752  0.7974063  1.797241
##   0.00    5    3.054918  0.7091331  2.157627
##   0.00    6    3.616891  0.6452446  2.377875
##   0.00    7    6.828992  0.5669780  4.131037
##   0.00    8    7.215473  0.4390305  4.255543
##   0.00    9    7.868435  0.5330447  4.460909
##   0.00   10    4.687039  0.5024569  3.358780
##   0.01    1    2.399650  0.7747890  1.895779
##   0.01    2    2.672432  0.7215392  2.143019
##   0.01    3    2.336309  0.7830089  1.925250
##   0.01    4    2.407845  0.7723048  1.953429
##   0.01    5    2.478010  0.7735236  1.965944
##   0.01    6    2.819053  0.7098835  2.283037
##   0.01    7    3.184249  0.6535647  2.503848
##   0.01    8    3.329401  0.6476834  2.608900
##   0.01    9    3.424461  0.5984221  2.655507
##   0.01   10    3.549240  0.6102028  2.713519
##   0.10    1    2.409116  0.7733430  1.899935
##   0.10    2    2.641610  0.7293005  2.085931
##   0.10    3    2.549721  0.7415061  2.070464
##   0.10    4    2.437229  0.7735266  1.953236
##   0.10    5    2.465427  0.7769614  1.955106
##   0.10    6    2.802721  0.6975629  2.251406
##   0.10    7    2.979128  0.6799093  2.367295
##   0.10    8    2.902694  0.6891414  2.372789
##   0.10    9    3.148114  0.6404661  2.460830
##   0.10   10    2.983838  0.6820981  2.504837
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.
nnetPred <- predict(nnetTune, testData$x)

postResample(nnetPred, testData$y)
##      RMSE  Rsquared       MAE 
## 1.7867916 0.8738675 1.3735092

MARS Model

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(613)
marsTuned <- train(trainingData$x,
                   trainingData$y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.322580  0.2543399  3.610030
##   1        3      3.748215  0.4252332  3.036391
##   1        4      2.645509  0.7223745  2.108078
##   1        5      2.402542  0.7679191  1.888667
##   1        6      2.378116  0.7741220  1.845874
##   1        7      1.822505  0.8667803  1.417201
##   1        8      1.706965  0.8790185  1.342523
##   1        9      1.658683  0.8840241  1.287723
##   1       10      1.623801  0.8910466  1.275374
##   1       11      1.610206  0.8913385  1.251271
##   1       12      1.627807  0.8899860  1.265812
##   1       13      1.627496  0.8904753  1.268445
##   1       14      1.642775  0.8881009  1.285885
##   1       15      1.644248  0.8879273  1.285336
##   1       16      1.644248  0.8879273  1.285336
##   1       17      1.644248  0.8879273  1.285336
##   1       18      1.644248  0.8879273  1.285336
##   1       19      1.644248  0.8879273  1.285336
##   1       20      1.644248  0.8879273  1.285336
##   1       21      1.644248  0.8879273  1.285336
##   1       22      1.644248  0.8879273  1.285336
##   1       23      1.644248  0.8879273  1.285336
##   1       24      1.644248  0.8879273  1.285336
##   1       25      1.644248  0.8879273  1.285336
##   1       26      1.644248  0.8879273  1.285336
##   1       27      1.644248  0.8879273  1.285336
##   1       28      1.644248  0.8879273  1.285336
##   1       29      1.644248  0.8879273  1.285336
##   1       30      1.644248  0.8879273  1.285336
##   1       31      1.644248  0.8879273  1.285336
##   1       32      1.644248  0.8879273  1.285336
##   1       33      1.644248  0.8879273  1.285336
##   1       34      1.644248  0.8879273  1.285336
##   1       35      1.644248  0.8879273  1.285336
##   1       36      1.644248  0.8879273  1.285336
##   1       37      1.644248  0.8879273  1.285336
##   1       38      1.644248  0.8879273  1.285336
##   2        2      4.322580  0.2543399  3.610030
##   2        3      3.748215  0.4252332  3.036391
##   2        4      2.645509  0.7223745  2.108078
##   2        5      2.410179  0.7670986  1.889079
##   2        6      2.299175  0.7890546  1.789842
##   2        7      1.816732  0.8673148  1.404578
##   2        8      1.604893  0.8983251  1.244678
##   2        9      1.470507  0.9134121  1.170182
##   2       10      1.416355  0.9184502  1.145227
##   2       11      1.330617  0.9263783  1.086287
##   2       12      1.270846  0.9339981  1.014435
##   2       13      1.290552  0.9338261  1.030310
##   2       14      1.282980  0.9325176  1.030299
##   2       15      1.259247  0.9360780  1.000936
##   2       16      1.284517  0.9332954  1.026027
##   2       17      1.269574  0.9353742  1.015216
##   2       18      1.268393  0.9354958  1.014549
##   2       19      1.268393  0.9354958  1.014549
##   2       20      1.268393  0.9354958  1.014549
##   2       21      1.268393  0.9354958  1.014549
##   2       22      1.268393  0.9354958  1.014549
##   2       23      1.268393  0.9354958  1.014549
##   2       24      1.268393  0.9354958  1.014549
##   2       25      1.268393  0.9354958  1.014549
##   2       26      1.268393  0.9354958  1.014549
##   2       27      1.268393  0.9354958  1.014549
##   2       28      1.268393  0.9354958  1.014549
##   2       29      1.268393  0.9354958  1.014549
##   2       30      1.268393  0.9354958  1.014549
##   2       31      1.268393  0.9354958  1.014549
##   2       32      1.268393  0.9354958  1.014549
##   2       33      1.268393  0.9354958  1.014549
##   2       34      1.268393  0.9354958  1.014549
##   2       35      1.268393  0.9354958  1.014549
##   2       36      1.268393  0.9354958  1.014549
##   2       37      1.268393  0.9354958  1.014549
##   2       38      1.268393  0.9354958  1.014549
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
marsPred <- predict(marsTuned, testData$x)

postResample(marsPred, testData$y)
##      RMSE  Rsquared       MAE 
## 1.1589948 0.9460418 0.9250230

SVM Model

svmRTuned <- train(trainingData$x,
                   trainingData$y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.499441  0.8077423  1.989067
##      0.50  2.227383  0.8291063  1.755788
##      1.00  2.032303  0.8520541  1.602510
##      2.00  1.950671  0.8626296  1.531516
##      4.00  1.873928  0.8715331  1.473022
##      8.00  1.855371  0.8729816  1.485854
##     16.00  1.851335  0.8734116  1.487928
##     32.00  1.851335  0.8734116  1.487928
##     64.00  1.851335  0.8734116  1.487928
##    128.00  1.851335  0.8734116  1.487928
##    256.00  1.851335  0.8734116  1.487928
##    512.00  1.851335  0.8734116  1.487928
##   1024.00  1.851335  0.8734116  1.487928
##   2048.00  1.851335  0.8734116  1.487928
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06613742
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06613742 and C = 16.
svmPred <- predict(svmRTuned, testData$x)

postResample(svmPred, testData$y)
##      RMSE  Rsquared       MAE 
## 2.0815349 0.8244315 1.5814080

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

Let’s compare the evaluation metrics of each model.

rbind(knnMod = postResample(knnPred, testData$y),
      nnetMod = postResample(nnetPred, testData$y),
      marsMod = postResample(marsPred, testData$y),
      svmMod = postResample(svmPred, testData$y))
##             RMSE  Rsquared      MAE
## knnMod  3.204059 0.6819919 2.568346
## nnetMod 1.786792 0.8738675 1.373509
## marsMod 1.158995 0.9460418 0.925023
## svmMod  2.081535 0.8244315 1.581408

The MARS model has the best performance with the highest \(R^2\) of 0.95 and the lowest \(RMSE\) of 1.16.

varImp(marsTuned)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.73
## X5   15.52
## X3    0.00

The model selects X1, X4, X2, and X5 as the most informative predictors. However, X3 has an overall importance of 0 in this model.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

# load data
data(ChemicalManufacturingProcess)

# apply same imputations, data splitting, and pre-processing as done in HW7

# impute
imputations <- preProcess(ChemicalManufacturingProcess, 
               method = c("knnImpute"), 
               k=5)

chem_man_imputed <- predict(imputations, ChemicalManufacturingProcess)

# filter nonzero variance variables
chem_man_filtered <- chem_man_imputed[,-nearZeroVar(chem_man_imputed)]

set.seed(613)  # for reproducibility

# split into training and testing
train_indices <- sample(nrow(chem_man_filtered), nrow(chem_man_filtered)*.8, replace=F)

trainChem <- chem_man_filtered[train_indices,]
testChem <- chem_man_filtered[-train_indices,]

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

KNN Model

knnModel <- train(Yield ~ .,
                  data=trainChem,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel
## k-Nearest Neighbors 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 140, 140, 140, 140, 140, 140, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7904113  0.4030058  0.6188164
##    7  0.7660230  0.4275085  0.6036355
##    9  0.7580353  0.4398730  0.6072703
##   11  0.7544116  0.4465885  0.6074600
##   13  0.7533996  0.4477019  0.6078428
##   15  0.7529933  0.4482641  0.6103877
##   17  0.7539529  0.4495622  0.6125312
##   19  0.7541291  0.4507737  0.6123520
##   21  0.7562679  0.4490719  0.6156502
##   23  0.7589265  0.4510659  0.6185059
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
knnPred <- predict(knnModel, testChem)

Neural Network Model

trainChem_x <- trainChem |> 
  dplyr::select(-Yield)

trainChem_y <- trainChem |>
  dplyr::select(Yield)

testChem_x <- testChem |> 
  dplyr::select(-Yield)

testChem_y <- testChem |>
  dplyr::select(Yield)

corr_indices <- findCorrelation(cor(trainChem_x), cutoff = 0.7)

trainChemFiltered <- trainChem_x[, -corr_indices]
testChemFiltered <- testChem_x[, -corr_indices]

trainChemFiltered$Yield <- trainChem_y$Yield
testChemFiltered$Yield <- testChem_y$Yield

There are no highly correlated variables. We can proceed with a neural network model..

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), 
                        .size = c(1:10))

set.seed(613)
nnetTune <- train(Yield ~ .,
                  data=trainChemFiltered,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv"),
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainChemFiltered)) + 10 + 1,
                  maxit = 500)

nnetTune
## Neural Network 
## 
## 140 samples
##  34 predictor
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 127, 126, 124, 126, 126, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00    1    0.9023743  0.3606740  0.7138930
##   0.00    2    1.1531904  0.3055392  0.8629739
##   0.00    3    1.1092196  0.3408580  0.8634495
##   0.00    4    1.3169755  0.2837269  1.0059900
##   0.00    5    1.3545693  0.2307622  1.1082244
##   0.00    6    1.1423440  0.1982726  0.9209122
##   0.00    7    1.0979439  0.3388930  0.9254208
##   0.00    8    1.0222017  0.3915784  0.8079884
##   0.00    9    1.1264186  0.3072208  0.9113648
##   0.00   10    1.0145311  0.3390948  0.7789101
##   0.01    1    0.8989321  0.3783182  0.7167324
##   0.01    2    0.9999304  0.3839547  0.7940897
##   0.01    3    1.2912675  0.3080647  1.0005925
##   0.01    4    1.1982446  0.2797407  0.9193959
##   0.01    5    1.0552648  0.3356105  0.8564671
##   0.01    6    1.0255680  0.3534374  0.8273259
##   0.01    7    0.9132168  0.4486564  0.7156832
##   0.01    8    0.7902489  0.5151836  0.6463621
##   0.01    9    0.7863965  0.4788332  0.6458413
##   0.01   10    0.8033223  0.5436836  0.6398243
##   0.10    1    0.7422871  0.5569085  0.5959207
##   0.10    2    0.8681757  0.4567303  0.7085806
##   0.10    3    1.0457898  0.3434175  0.8517325
##   0.10    4    1.0309420  0.3171089  0.8004476
##   0.10    5    0.8977346  0.4462025  0.6993282
##   0.10    6    0.7829058  0.5234373  0.6300609
##   0.10    7    0.7962628  0.5722234  0.6048363
##   0.10    8    0.8089790  0.4854038  0.6471713
##   0.10    9    0.8618077  0.4806067  0.6683053
##   0.10   10    0.7789757  0.5558081  0.6239207
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.1.
nnetPred <- predict(nnetTune, testChemFiltered)

MARS Model

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(613)
marsTuned <- train(Yield ~ .,
                   data=trainChem,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = trainControl(method = "cv"))

marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 140 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 127, 126, 124, 126, 126, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        2      0.7640112  0.4730040  0.6054886
##   1        3      0.6905839  0.5703117  0.5425640
##   1        4      0.6201914  0.6456619  0.4999825
##   1        5      0.6174550  0.6412863  0.5041031
##   1        6      0.6268730  0.6354610  0.5104723
##   1        7      0.6263290  0.6324963  0.4989311
##   1        8      0.6251756  0.6350417  0.4877998
##   1        9      0.6511419  0.6181760  0.5100114
##   1       10      0.6611002  0.6074284  0.5190580
##   1       11      0.6716547  0.5968301  0.5270800
##   1       12      0.6755107  0.5993407  0.5308035
##   1       13      0.6668321  0.6071475  0.5254225
##   1       14      0.6732291  0.6034138  0.5313733
##   1       15      0.6789386  0.6001463  0.5390155
##   1       16      0.6789386  0.6001463  0.5390155
##   1       17      0.6849114  0.5955717  0.5420275
##   1       18      0.6950174  0.5944802  0.5471197
##   1       19      0.6947070  0.5948302  0.5460859
##   1       20      0.6981732  0.5928353  0.5467338
##   1       21      0.6981732  0.5928353  0.5467338
##   1       22      0.6981732  0.5928353  0.5467338
##   1       23      0.6981732  0.5928353  0.5467338
##   1       24      0.6981732  0.5928353  0.5467338
##   1       25      0.6981732  0.5928353  0.5467338
##   1       26      0.6981732  0.5928353  0.5467338
##   1       27      0.6981732  0.5928353  0.5467338
##   1       28      0.6981732  0.5928353  0.5467338
##   1       29      0.6981732  0.5928353  0.5467338
##   1       30      0.6981732  0.5928353  0.5467338
##   1       31      0.6981732  0.5928353  0.5467338
##   1       32      0.6981732  0.5928353  0.5467338
##   1       33      0.6981732  0.5928353  0.5467338
##   1       34      0.6981732  0.5928353  0.5467338
##   1       35      0.6981732  0.5928353  0.5467338
##   1       36      0.6981732  0.5928353  0.5467338
##   1       37      0.6981732  0.5928353  0.5467338
##   1       38      0.6981732  0.5928353  0.5467338
##   2        2      0.7640112  0.4730040  0.6054886
##   2        3      0.7168840  0.5486186  0.5754676
##   2        4      0.6937957  0.5788373  0.5547862
##   2        5      0.7078381  0.5342889  0.5516909
##   2        6      0.6893286  0.5506447  0.5411029
##   2        7      0.7157568  0.5461100  0.5584418
##   2        8      0.7282620  0.5433741  0.5625683
##   2        9      0.7114619  0.5600860  0.5482608
##   2       10      0.7531098  0.5073165  0.5671176
##   2       11      0.7146329  0.5436995  0.5557421
##   2       12      0.7380431  0.5283689  0.5672111
##   2       13      0.7394269  0.5240770  0.5733455
##   2       14      0.7603234  0.5134955  0.5860571
##   2       15      0.7720708  0.4953263  0.5934611
##   2       16      0.7878424  0.4858962  0.6061875
##   2       17      0.8242476  0.4652130  0.6061352
##   2       18      0.8271907  0.4735686  0.6047289
##   2       19      0.8227314  0.4749919  0.6031580
##   2       20      0.8153485  0.4769102  0.5968650
##   2       21      0.8085394  0.4778036  0.5971506
##   2       22      0.8176210  0.4775053  0.6051469
##   2       23      0.8089366  0.4843386  0.6014091
##   2       24      0.8097640  0.4780909  0.5999898
##   2       25      0.8157439  0.4685095  0.6068908
##   2       26      0.8176699  0.4659336  0.6097518
##   2       27      0.8176699  0.4659336  0.6097518
##   2       28      0.8189274  0.4621674  0.6103464
##   2       29      0.8172696  0.4638202  0.6084594
##   2       30      0.8172696  0.4638202  0.6084594
##   2       31      0.8172696  0.4638202  0.6084594
##   2       32      0.8172696  0.4638202  0.6084594
##   2       33      0.8172696  0.4638202  0.6084594
##   2       34      0.8172696  0.4638202  0.6084594
##   2       35      0.8172696  0.4638202  0.6084594
##   2       36      0.8172696  0.4638202  0.6084594
##   2       37      0.8172696  0.4638202  0.6084594
##   2       38      0.8172696  0.4638202  0.6084594
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
marsPred <- predict(marsTuned, testChem)

SVM Model

svmRTuned <- train(Yield ~ .,
                   data=trainChem,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))

svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 126, 124, 128, 126, 127, 125, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE       Rsquared   MAE      
##      0.25  0.7283012  0.5592173  0.5939106
##      0.50  0.6633791  0.6180803  0.5369524
##      1.00  0.6191929  0.6697225  0.5000055
##      2.00  0.5787403  0.7120096  0.4638110
##      4.00  0.5637067  0.7157505  0.4448834
##      8.00  0.5488654  0.7283117  0.4395413
##     16.00  0.5480276  0.7290105  0.4391489
##     32.00  0.5480276  0.7290105  0.4391489
##     64.00  0.5480276  0.7290105  0.4391489
##    128.00  0.5480276  0.7290105  0.4391489
##    256.00  0.5480276  0.7290105  0.4391489
##    512.00  0.5480276  0.7290105  0.4391489
##   1024.00  0.5480276  0.7290105  0.4391489
##   2048.00  0.5480276  0.7290105  0.4391489
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01414407
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01414407 and C = 16.
svmPred <- predict(svmRTuned, testChem)

Let’s compare the accuracy metrics for each model.

rbind(knnMod = postResample(knnPred, testChem$Yield),
      nnetMod = postResample(nnetPred, testChemFiltered$Yield),
      marsMod = postResample(marsPred, testChem$Yield),
      svmMod = postResample(svmPred, testChem$Yield))
##              RMSE  Rsquared       MAE
## knnMod  0.8357339 0.1900885 0.6474526
## nnetMod 0.6769329 0.4697276 0.5464301
## marsMod 0.6301207 0.5449365 0.4941829
## svmMod  0.5850169 0.5842473 0.4863866

The SVM model has the highest \(R^2\) of 0.58 and lowest \(RMSE\) of 0.59 so this is the best performing model.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Let’s take a look at the top 10 most important predictors for the SVM model.

plot(varImp(svmRTuned), 10)

The process variables dominate the most important predictors. However, there are more biological variables that are important in the SVM model than there were in the linear model from HW7.

lasso_mod <- train(Yield ~ .,
                   data=trainChem,
                   method = "glmnet",
                   preProcess = c("center", "scale"),
                   trControl = trainControl(method = "cv"),
                   tuneGrid = expand.grid(.alpha = 1, .lambda = seq(0, 1, 0.05)))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(varImp(lasso_mod), 10)

There were only seven actually important variables in the linear model trained for HW 7. ManufacturingProcess32 remains the top most important variable for both models.

Other predictors that are present as important variables in both models are ManufacturingProcess13, ManufacturingProcess09, ManufacturingProcess17, ManufacturingProcess36, ManufacturingProcess06, and BiologicalMaterical02.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

chem_man_filtered[,c("Yield", "BiologicalMaterial06", "ManufacturingProcess31", "BiologicalMaterial03", "BiologicalMaterial12")] |>
  cor() |>
  corrplot(method="color",
           diag=FALSE,
           type="lower",
           addCoef.col = "black",
           number.cex=0.5)

If we take a look down the first column, BiologicalMaterial06 has the highest positive correlation with Yield, followed by BiologicalMaterial03 and BiologicalMaterial12. ManufacturingProcess31 has only a slight negative correlation with Yield. It is unclear why this should be such an important variable in the optimal model. Further testing would need to be done to see if this variable is possibly highly correlated with other variables in the dataset.