Applied Predictive Modeling (Kuhn & Johnson)

Non-Linear Regression Models

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[ y = 10sin(\pi x_1x_2)+20(x_3−0.5)^2+10x_4+5x_5+N(0,\sigma^2)\] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
## Warning: executing %dopar% sequentially: no parallel backend registered
## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.633298  0.7311434  2.062484
##   0.00    5    3.285459  0.6470466  2.379608
##   0.00   10    2.903921  0.6844344  2.274391
##   0.01    1    2.597128  0.7419913  2.019047
##   0.01    5    2.536430  0.7544127  1.999224
##   0.01   10    2.753848  0.7126363  2.175934
##   0.10    1    2.595957  0.7411858  2.013476
##   0.10    5    2.485368  0.7622012  1.972273
##   0.10   10    2.513428  0.7562287  1.989455
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag
##  = FALSE.
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.459160  0.2269009  3.648621
##   1        3      3.703808  0.4627759  2.998624
##   1        4      2.788330  0.6952911  2.247361
##   1        5      2.552061  0.7438976  2.038840
##   1        6      2.398227  0.7737511  1.917525
##   1        7      1.956515  0.8489167  1.536833
##   1        8      1.859780  0.8631867  1.447584
##   1        9      1.768654  0.8765206  1.374287
##   1       10      1.764931  0.8775507  1.356467
##   1       11      1.779741  0.8766418  1.376192
##   1       12      1.774808  0.8772249  1.372218
##   1       13      1.805089  0.8726829  1.397210
##   1       14      1.819615  0.8711360  1.406696
##   1       15      1.835221  0.8695754  1.416871
##   1       16      1.840524  0.8687417  1.422030
##   1       17      1.842401  0.8683960  1.425353
##   1       18      1.842401  0.8683960  1.425353
##   1       19      1.842401  0.8683960  1.425353
##   1       20      1.842401  0.8683960  1.425353
##   2        2      4.471734  0.2244806  3.647081
##   2        3      3.714218  0.4599572  3.004844
##   2        4      2.861317  0.6777013  2.315312
##   2        5      2.553105  0.7439500  2.051875
##   2        6      2.446188  0.7645488  1.949440
##   2        7      2.053872  0.8319061  1.614748
##   2        8      1.861883  0.8626461  1.455725
##   2        9      1.730611  0.8802498  1.353077
##   2       10      1.600061  0.8971990  1.254381
##   2       11      1.511413  0.9084055  1.189547
##   2       12      1.542350  0.9052716  1.187687
##   2       13      1.509975  0.9103798  1.163212
##   2       14      1.467450  0.9149851  1.139031
##   2       15      1.475360  0.9139041  1.147567
##   2       16      1.490228  0.9115933  1.145509
##   2       17      1.492258  0.9109130  1.142455
##   2       18      1.490088  0.9110852  1.139128
##   2       19      1.489183  0.9112096  1.138835
##   2       20      1.489997  0.9110402  1.139272
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.598750  0.7792058  2.072381
##     0.50  2.377193  0.7917602  1.885072
##     1.00  2.238917  0.8081062  1.765140
##     2.00  2.168225  0.8184332  1.700831
##     4.00  2.136811  0.8225716  1.669194
##     8.00  2.132541  0.8229563  1.666091
##    16.00  2.133316  0.8228263  1.666654
##    32.00  2.133316  0.8228263  1.666654
##    64.00  2.133316  0.8228263  1.666654
##   128.00  2.133316  0.8228263  1.666654
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06773352
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06773352 and C = 8.
KNN NNET MARS SVM
RMSE 3.175066 2.170368 1.277999 2.075607
## earth variable importance
## 
##     Overall
## X1   100.00
## X4    84.98
## X2    68.87
## X5    48.55
## X3    38.96
## X7     0.00
## X9     0.00
## X8     0.00
## X10    0.00
## X6     0.00

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

MARS model showed the best performance amongst all different algorithms with and RMSE of 1.277999. It is confirmed it selected predictors X1:X5 as the most informative.

Exercise 7.5

Exercise 6.3 describes data for chemical manufacturing process.

(A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.)

Use the same data imputation, data splitting and pre-processing steps as before and train several nonlinear regression models.

Modeling & Validation
Linear Model (PLS) - For Reference
## Partial Least Squares 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.351010  0.4485164  1.102844
##    2     1.260323  0.5154151  1.016047
##    3     1.273289  0.5153561  1.028123
##    4     1.309761  0.4989806  1.059869
##    5     1.347540  0.4830462  1.084036
##    6     1.382643  0.4660665  1.108433
##    7     1.426869  0.4437381  1.147465
##    8     1.466832  0.4273872  1.179911
##    9     1.522083  0.4010375  1.228305
##   10     1.569062  0.3827168  1.264325
##   11     1.621665  0.3653264  1.296599
##   12     1.659835  0.3524162  1.318829
##   13     1.703314  0.3384623  1.343085
##   14     1.744742  0.3275454  1.364165
##   15     1.794207  0.3150951  1.391282
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

Nonlinear Models (KNN, Neural Networks, MARS & SVM)
## k-Nearest Neighbors 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.402749  0.4048961  1.100072
##    7  1.381977  0.4197026  1.104513
##    9  1.382696  0.4216603  1.117713
##   11  1.377755  0.4270218  1.118546
##   13  1.392717  0.4211626  1.132558
##   15  1.393130  0.4275926  1.131653
##   17  1.399322  0.4260907  1.142817
##   19  1.409851  0.4216393  1.156186
##   21  1.418607  0.4178667  1.168636
##   23  1.420105  0.4270186  1.170369
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
## Model Averaged Neural Network 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.610693  0.3282347  1.317989
##   0.00    5    2.096759  0.2857749  1.662703
##   0.00   10    7.453411  0.1163407  4.777774
##   0.01    1    1.632921  0.3537768  1.287717
##   0.01    5    2.382653  0.2224624  1.692331
##   0.01   10    2.164522  0.3030359  1.667056
##   0.10    1    1.935140  0.3037360  1.474595
##   0.10    5    2.680296  0.2035901  1.783633
##   0.10   10    2.267248  0.2250593  1.679761
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0 and bag
##  = FALSE.
## Multivariate Adaptive Regression Spline 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      1.474974  0.3937771  1.183436
##   1        3      1.347717  0.4944946  1.077682
##   1        4      1.358573  0.4855082  1.088398
##   1        5      1.352871  0.4895013  1.089438
##   1        6      1.464143  0.4679334  1.121022
##   1        7      1.608554  0.4405143  1.168835
##   1        8      1.597393  0.4333546  1.168652
##   1        9      1.673360  0.4178674  1.197718
##   1       10      1.744609  0.4141018  1.224317
##   2        2      1.476052  0.3932149  1.184854
##   2        3      1.404857  0.4529665  1.125711
##   2        4      1.378437  0.4700282  1.096039
##   2        5      1.836718  0.4300026  1.201290
##   2        6      1.885691  0.4048881  1.209516
##   2        7      1.885759  0.3943905  1.221275
##   2        8      1.866741  0.3979609  1.216667
##   2        9      1.609065  0.4016889  1.175496
##   2       10      1.695258  0.3931507  1.214503
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 124 samples
##  47 predictor
## 
## Pre-processing: centered (47), scaled (47) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE      
##     0.25  1.403416  0.4964341  1.1574251
##     0.50  1.311130  0.5279193  1.0770035
##     1.00  1.265853  0.5427493  1.0208705
##     2.00  1.259414  0.5441863  0.9918689
##     4.00  1.268521  0.5390152  0.9855121
##     8.00  1.271589  0.5359192  0.9834915
##    16.00  1.271569  0.5358966  0.9835233
##    32.00  1.271569  0.5358966  0.9835233
##    64.00  1.271569  0.5358966  0.9835233
##   128.00  1.271569  0.5358966  0.9835233
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0125532
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0125532 and C = 2.
  1. Which nonlinear regression model gives the optimal resampling and test set performance?

Resampling Performance

## 
## Call:
## summary.resamples(object = resampl)
## 
## Models: KNN, NNet, MARS, SVM 
## Number of resamples: 25 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## KNN  0.9058592 0.9870896 1.1127841 1.1185456 1.203046 1.558048    0
## NNet 1.0495308 1.2273437 1.3294560 1.3179887 1.385199 1.661394    0
## MARS 0.8684310 0.9876964 1.0447313 1.0776815 1.118345 1.522757    0
## SVM  0.7658697 0.9145722 0.9718791 0.9918689 1.070424 1.197459    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## KNN  1.178216 1.234169 1.354816 1.377755 1.463125 1.929067    0
## NNet 1.285416 1.487737 1.629464 1.610693 1.747954 1.935362    0
## MARS 1.116609 1.234721 1.328910 1.347717 1.441420 1.774211    0
## SVM  1.056796 1.161601 1.216338 1.259414 1.352619 1.538968    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## KNN  0.2201409 0.3727362 0.4039894 0.4270218 0.4947506 0.5988764    0
## NNet 0.1031603 0.2770266 0.3469157 0.3282347 0.4008970 0.5310045    0
## MARS 0.3043747 0.4469323 0.4782430 0.4944946 0.5626903 0.6338061    0
## SVM  0.4526292 0.5040173 0.5460373 0.5441863 0.5853110 0.6747305    0
MARS model appears to be the most optimal amongst all the algorithms across all the performance metrics (MAE, RMSE & Rsquared). Closely followed by SVM

Test set Performance

KNN NNET MARS SVM PLS
RMSE 1.5946023 1.7669049 1.5386158 1.2414576 1.3105629
Rsquared 0.3451473 0.2782982 0.3755943 0.6073041 0.5551553
MAE 1.3030944 1.4193041 1.1283177 0.9816494 1.0267944
Interestingly, for the Test set, SVM model appears to be the most optimal amongst all the algorithms across all the performance metrics (RMSE, Rsquared & MAE). Closely followed by MARS. Another important point ot mention is that the linear model (PLS), actually comes second overall, above MARS.
  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0125531978064151 
## 
## Number of Support Vectors : 113 
## 
## Objective Function Value : -60.9605 
## Training error : 0.106214

The most important predictors for the optimal nonlinear model (SVM) are primarily ManufacturingProcess predictors (7 out of 10). Interestingly, one of the 3 BiologicalMaterial predictors is the second most relevant for the model. Regarding the linear model (PLS), same proportion of ManufacturingProcess predictors in the top 10 (7 out of 10), but in this model, is more clearly defined the predominant influence over the BiologicalMaterial ones.
  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

There is a big overlap in the top 20 relevant predictors between the nonlinear and linear models, there are only 4 predictors that are unique in the nonlinear model, all of them ManufacturingProcess. Most of them do not reveal an evident relationship with the Yield response as the SVM radial kernel transforms the training data into a higher dimensional space to regress the coresponding response variable values.