7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

Tune several models on these data.

KNN Model

Let’s start with the K-Nearest Neighbor (KNN) model which classifies the target variable based on the k nearest (in Euclidean distance) neighbors in the training set and the final classification is decided by majority vote, with ties broken at random.

##   k
## 3 9
##    rsquared    rmse
## 1 0.6701072 3.07972

The optimal number of neighbors for the KNN model which resulted in the smallest root mean squared error is 9. It has RMSE = 3.079 and R2 = 0.67. It explains the highest amount of variability. The top 5 informative predictors are X4, X1, X2, X5, X3 as can be seen below.

##      RMSE  Rsquared       MAE 
## 3.1172319 0.6556622 2.4899907

SVM model

##        sigma C
## 6 0.06509124 8
##    rsquared     rmse
## 1 0.8547582 1.918711

RMSE is used to select the optimal model using the smallest value. The best hyperparameter for the SVM model which results in the smallest root mean squared error is 8. The tuning parameter ‘sigma’ was held constant at a value of 0.065. It has RMSE = 1.91, and R2 = 0.854. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, the top 5 informative predictors are X4, X1, X2, X5, and X3.

#plot(svmModel)

MARS Model

##    nprune degree
## 43     15      2
##    rsquared     rmse
## 1 0.8841742 1.639073

RMSE was used to select the optimal model using the smallest value. The best tuned parameters for the MARS model which resulted in the smallest root mean squared error is with 2 degrees of interactions and the number of retained terms of 15. It has RMSE = 1.63, and R2 = 0.884. This accounts for the largest portion of the variability in the data than all other variables, and it produces the smallest error. The top 5 informative predictors are X1, X4, X2, X5 and X3.

Neural Net Model

##    size decay   bag
## 25    5   0.1 FALSE
##    rsquared     rmse
## 1 0.5644107 4.484215

RMSE is used to select the optimal model using the smallest value. The best tuned parameters for the NNET model which result in the smallest root mean squared error is with the number of units in the hidden layer being 5 and the regularization parameter to avoid over-fitting is 0.1. It has RMSE = 4.48, and R2 = 0.564. This accounts for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, the top 5 informative predictors are X4, X1, X2, X5, and X3.

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

##          RMSE  Rsquared      MAE
## KNN  3.117232 0.6556622 2.489991
## SVM  2.063191 0.8275736 1.566221
## MARS 1.158995 0.9460418 0.925023
## NNET 2.111396 0.8277556 1.573901

From the results above, it suggests that the MARS model explains a larger portion of the variability with X1-X5 informative predictors. It resulted in a root mean squared error that is the smallest among the models with the test data at RMSE = 1.15. It can therefore be stated that the Multivariate Adaptive Regression Splines model best fitts the training data than the K-Nearest Neighbors, Support Vector Machine, and Neural Networks models.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

vars n mean sd median trimmed mad min max range skew kurtosis se
Yield 1 176 40.1765341 1.8456664 39.970 40.1150000 1.9718580 35.250 46.340 11.090 0.3109596 -0.1132944 0.1391223
BiologicalMaterial01 2 176 6.4114205 0.7139225 6.305 6.3933803 0.6745830 4.580 8.810 4.230 0.2733165 0.4567758 0.0538139
BiologicalMaterial02 3 176 55.6887500 4.0345806 55.090 55.5810563 4.5812340 46.870 64.750 17.880 0.2441269 -0.7050911 0.3041180
BiologicalMaterial03 4 176 67.7050000 4.0010641 67.220 67.6780986 4.2773010 56.970 78.250 21.280 0.0285108 -0.1235203 0.3015916
BiologicalMaterial04 5 176 12.3492614 1.7746607 12.100 12.1860563 1.3714050 9.380 23.090 13.710 1.7323153 7.0564614 0.1337701
BiologicalMaterial05 6 176 18.5986364 1.8441408 18.490 18.5488732 1.8829020 13.240 24.850 11.610 0.3040053 0.2198005 0.1390073
BiologicalMaterial06 7 176 48.9103977 3.7460718 48.460 48.7379577 3.9437160 40.600 59.380 18.780 0.3685344 -0.3654933 0.2823708
BiologicalMaterial07 8 176 100.0141477 0.1077423 100.000 100.0000000 0.0000000 100.000 100.830 0.830 7.3986642 53.0417012 0.0081214
BiologicalMaterial08 9 176 17.4947727 0.6769536 17.510 17.4687324 0.5930400 15.880 19.140 3.260 0.2200539 0.0627721 0.0510273
BiologicalMaterial09 10 176 12.8500568 0.4151757 12.835 12.8635211 0.4225410 11.440 14.080 2.640 -0.2684177 0.2927765 0.0312950
BiologicalMaterial10 11 176 2.8006250 0.5991433 2.710 2.7328873 0.4003020 1.770 6.870 5.100 2.4023783 11.6471845 0.0451621
BiologicalMaterial11 12 176 146.9531818 4.8204704 146.080 146.7863380 4.1142150 135.810 158.730 22.920 0.3588211 0.0162456 0.3633566
BiologicalMaterial12 13 176 20.1998864 0.7735440 20.120 20.1776056 0.6671700 18.350 22.210 3.860 0.3038443 0.0146595 0.0583081
ManufacturingProcess01 14 175 11.2074286 1.8224342 11.400 11.4078014 1.0378200 0.000 14.100 14.100 -3.9201855 21.8688069 0.1377631
ManufacturingProcess02 15 173 16.6826590 8.4715694 21.000 18.0575540 1.4826000 0.000 22.500 22.500 -1.4307675 0.1062466 0.6440815
ManufacturingProcess03 16 161 1.5395652 0.0223983 1.540 1.5410078 0.0148260 1.470 1.600 0.130 -0.4799447 1.7280557 0.0017652
ManufacturingProcess04 17 175 931.8514286 6.2744406 934.000 932.2836879 5.9304000 911.000 946.000 35.000 -0.6979357 0.0631282 0.4743031
ManufacturingProcess05 18 175 1001.6931429 30.5272134 999.200 998.6248227 17.3464200 923.000 1175.300 252.300 2.5872769 11.7446904 2.3076404
ManufacturingProcess06 19 174 207.4017241 2.6993999 206.800 207.0928571 1.9273800 203.000 227.400 24.400 3.0419007 17.3764864 0.2046410
ManufacturingProcess07 20 175 177.4800000 0.5010334 177.000 177.4751773 0.0000000 177.000 178.000 1.000 0.0793788 -2.0050587 0.0378746
ManufacturingProcess08 21 175 177.5542857 0.4984706 178.000 177.5673759 0.0000000 177.000 178.000 1.000 -0.2165645 -1.9642262 0.0376808
ManufacturingProcess09 22 176 45.6601136 1.5464407 45.730 45.7188732 1.2157320 38.890 49.360 10.470 -0.9406685 3.2701986 0.1165674
ManufacturingProcess10 23 167 9.1790419 0.7666884 9.100 9.1318519 0.5930400 7.500 11.600 4.100 0.6492504 0.6317264 0.0593281
ManufacturingProcess11 24 166 9.3855422 0.7157336 9.400 9.3932836 0.6671700 7.500 11.500 4.000 -0.0193109 0.3227966 0.0555517
ManufacturingProcess12 25 175 857.8114286 1784.5282624 0.000 516.1985816 0.0000000 0.000 4549.000 4549.000 1.5786729 0.4951353 134.8976569
ManufacturingProcess13 26 176 34.5079545 1.0152800 34.600 34.5119718 0.8895600 32.100 38.600 6.500 0.4802776 1.9593883 0.0765296
ManufacturingProcess14 27 175 4853.8685714 54.5236412 4856.000 4854.5744681 40.0302000 4701.000 5055.000 354.000 -0.0109687 1.0781378 4.1215999
ManufacturingProcess15 28 176 6038.9204545 58.3125023 6031.500 6035.5211268 40.7715000 5904.000 6233.000 329.000 0.6743478 1.2162163 4.3954702
ManufacturingProcess16 29 176 4565.8011364 351.6973215 4588.000 4588.3591549 42.9954000 0.000 4852.000 4852.000 -12.4202248 158.3981993 26.5101831
ManufacturingProcess17 30 176 34.3437500 1.2482059 34.400 34.3126761 1.1860800 31.300 40.000 8.700 1.1629715 4.6626982 0.0940871
ManufacturingProcess18 31 176 4809.6818182 367.4777364 4835.000 4837.0704225 34.8411000 0.000 4971.000 4971.000 -12.7361378 163.7375845 27.6996766
ManufacturingProcess19 32 176 6028.1988636 45.5785689 6022.000 6026.1549296 36.3237000 5890.000 6146.000 256.000 0.2973414 0.2962151 3.4356139
ManufacturingProcess20 33 176 4556.4602273 349.0089784 4582.000 4580.9788732 42.9954000 0.000 4759.000 4759.000 -12.6383268 162.0663905 26.3075416
ManufacturingProcess21 34 176 -0.1642045 0.7782930 -0.300 -0.2556338 0.4447800 -1.800 3.600 5.400 1.7291140 5.0274763 0.0586660
ManufacturingProcess22 35 175 5.4057143 3.3306262 5.000 5.2482270 4.4478000 0.000 12.000 12.000 0.3148909 -1.0175458 0.2517717
ManufacturingProcess23 36 175 3.0171429 1.6625499 3.000 2.9432624 1.4826000 0.000 6.000 6.000 0.1967985 -0.9975572 0.1256770
ManufacturingProcess24 37 175 8.8342857 5.7994224 8.000 8.5744681 7.4130000 0.000 23.000 23.000 0.3593200 -1.0207362 0.4383951
ManufacturingProcess25 38 171 4828.1754386 373.4810865 4855.000 4855.5620438 34.0998000 0.000 4990.000 4990.000 -12.6310220 160.3293620 28.5608125
ManufacturingProcess26 39 171 6015.5964912 464.8674900 6047.000 6048.5547445 38.5476000 0.000 6161.000 6161.000 -12.6694398 160.9849144 35.5493055
ManufacturingProcess27 40 171 4562.5087719 353.9848679 4587.000 4587.4452555 35.5824000 0.000 4710.000 4710.000 -12.5174778 158.3931091 27.0698994
ManufacturingProcess28 41 171 6.5918129 5.2489823 10.400 6.8248175 1.0378200 0.000 11.500 11.500 -0.4556335 -1.7907822 0.4013997
ManufacturingProcess29 42 171 20.0111111 1.6638879 19.900 20.0437956 0.4447800 0.000 22.000 22.000 -10.0848133 119.4378857 0.1272407
ManufacturingProcess30 43 171 9.1614035 0.9760824 9.100 9.2145985 0.7413000 0.000 11.200 11.200 -4.7557268 43.0848842 0.0746429
ManufacturingProcess31 44 171 70.1847953 5.5557816 70.800 70.7240876 0.8895600 0.000 72.500 72.500 -11.8231008 146.0094297 0.4248612
ManufacturingProcess32 45 176 158.4659091 5.3972456 158.000 158.3380282 4.4478000 143.000 173.000 30.000 0.2112252 0.0602714 0.4068327
ManufacturingProcess33 46 171 63.5438596 2.4833813 64.000 63.5474453 1.4826000 56.000 70.000 14.000 -0.1310030 0.2740324 0.1899089
ManufacturingProcess34 47 171 2.4935673 0.0543910 2.500 2.4927007 0.0000000 2.300 2.600 0.300 -0.2634497 1.0013075 0.0041594
ManufacturingProcess35 48 171 495.5964912 10.8196874 495.000 495.7445255 8.8956000 463.000 522.000 59.000 -0.1556154 0.4130958 0.8274022
ManufacturingProcess36 49 171 0.0195731 0.0008739 0.020 0.0195620 0.0014826 0.017 0.022 0.005 0.1453141 -0.0557822 0.0000668
ManufacturingProcess37 50 176 1.0136364 0.4450828 1.000 0.9964789 0.4447800 0.000 2.300 2.300 0.3783578 0.0698597 0.0335494
ManufacturingProcess38 51 176 2.5340909 0.6493753 3.000 2.6126761 0.0000000 0.000 3.000 3.000 -1.6818052 3.9189211 0.0489485
ManufacturingProcess39 52 176 6.8511364 1.5054943 7.200 7.1718310 0.1482600 0.000 7.500 7.500 -4.2691214 16.4987895 0.1134809
ManufacturingProcess40 53 175 0.0177143 0.0382885 0.000 0.0099291 0.0000000 0.000 0.100 0.100 1.6768073 0.8164458 0.0028943
ManufacturingProcess41 54 175 0.0237143 0.0538242 0.000 0.0106383 0.0000000 0.000 0.200 0.200 2.1686898 3.6290714 0.0040687
ManufacturingProcess42 55 176 11.2062500 1.9416092 11.600 11.5429577 0.2965200 0.000 12.100 12.100 -5.4500082 28.5288867 0.1463543
ManufacturingProcess43 56 176 0.9119318 0.8679860 0.800 0.8077465 0.2965200 0.000 11.000 11.000 9.0548747 101.0332345 0.0654269
ManufacturingProcess44 57 176 1.8051136 0.3220062 1.900 1.8549296 0.1482600 0.000 2.100 2.100 -4.9703552 25.0876065 0.0242721
ManufacturingProcess45 58 176 2.1380682 0.4069043 2.200 2.2042254 0.1482600 0.000 2.600 2.600 -4.0779411 18.7565001 0.0306716
Missing Values
variable n_miss pct_miss
##                          X1                     X2     value
## 1997 ManufacturingProcess26 ManufacturingProcess25 0.9975339
## 2052 ManufacturingProcess25 ManufacturingProcess26 0.9975339
## 2054 ManufacturingProcess27 ManufacturingProcess26 0.9960721
## 2109 ManufacturingProcess26 ManufacturingProcess27 0.9960721
## 1998 ManufacturingProcess27 ManufacturingProcess25 0.9934932
## 2108 ManufacturingProcess25 ManufacturingProcess27 0.9934932
## 1599 ManufacturingProcess20 ManufacturingProcess18 0.9917474
## 1709 ManufacturingProcess18 ManufacturingProcess20 0.9917474
## 2002 ManufacturingProcess31 ManufacturingProcess25 0.9706780
## 2332 ManufacturingProcess25 ManufacturingProcess31 0.9706780

Let’s train some non-linear models next.

KNN Model for Chemical Manufacturing Process

##   k
## 1 5
##    rsquared     rmse
## 1 0.4743023 1.355025

The top 5 important variables are ManufacturingProcess13, ManufacturingProcess32, BiologicalMaterial06, ManufacturingProcess17 and BiologicalMaterial03.

The best tune for the KNN model which resulted in the smallest root mean squared error is 5 nearest neighbors. It has RMSE = 1.35, and R2 = 0.47. This model accounts for the largest portion of the variability in the data than all other latent variables, as well as produces the smallest error. Moreover, the residuals are quite small and there are a few top informative predictors over 35%.

##        sigma  C
## 7 0.01328094 16
##    rsquared     rmse
## 1 0.6573563 1.083869

RMSE would be used to select the optimal model using the smallest value. The best parameter for the SVM model which resulted in the smallest root mean squared error is 16. The tuning parameter ‘sigma’ was held constant at a value of 0.013. It has RMSE = 1.08, and R2 = 0.65. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error.

The top 5 important variables are ManufacturingProcess13, ManufacturingProcess32, BiologicalMaterial06, ManufacturingProcess17 and BiologicalMaterial03.

MARS Model for Chemical Manufacturing Process

##   nprune degree
## 4      5      1
##    rsquared     rmse
## 1 0.5170738 1.317713

RMSE is used to select the optimal model using the smallest value. The best parameter for the MARS model which resulted in the smallest root mean squared error is with 1 degree of interactions and the number of retained terms of 5. It has RMSE = 1.31, and R2 = 0.51. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Also, there are only 3 top informative predictors in this case - ManufacturingProcess32, ManufacturingProcess09 and ManufacturingProcess13.

NNET Model for Chemical Manufacturing Process

##   size decay   bag
## 5    5  0.01 FALSE
##    rsquared    rmse
## 1 0.3985338 1.68192

RMSE is used to select the optimal model using the smallest value. The best hyperparameters for the NNET model which result in the smallest root mean squared error is with the number of units in the hidden layer being 5 and the regularization parameter to avoid over-fitting is 0.01. It has RMSE = 1.68, and R2 = 0.39. In this case, it does account for the largest portion of the variability in the data than all other variables, even though that is only 39%, and it produces the smallest error.

The top 5 important variables are ManufacturingProcess13, ManufacturingProcess32, BiologicalMaterial06, ManufacturingProcess17 and BiologicalMaterial03.

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

Resampling Via resampling method, the performance metrics are calculated below and analyzed to select the model that best fits the data. The results below suggest that the SVM model had the largest mean R2 = 0.65 from the 10 sample cross-validations. Moreover, the SVM model also produced the smallest errors, RMSE = 1.08. It can therefore be stated that the SVM model best fits the data than the KNN, MARS, and NNET models.

## 
## Call:
## summary.resamples(object = resamples(list(knn = knnModel, svm = svmModel,
##  mars = marsModel, nnet = nnetModel)))
## 
## Models: knn, svm, mars, nnet 
## Number of resamples: 10 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## knn  0.7970000 1.0039667 1.1231429 1.0941885 1.1817286 1.279714    0
## svm  0.5528591 0.7857382 0.9124654 0.8615100 0.9403178 1.068548    0
## mars 0.7289428 0.8619740 0.9237133 0.9751312 1.1126344 1.230238    0
## nnet 0.8064562 1.2567945 1.4186034 1.4077181 1.5926685 1.933445    0
## 
## RMSE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## knn  1.0521074 1.266880 1.316138 1.355025 1.476499 1.661273    0
## svm  0.8083948 1.004793 1.072562 1.083869 1.182588 1.435115    0
## mars 0.9865612 1.139639 1.163285 1.215740 1.364162 1.503633    0
## nnet 0.9927290 1.520522 1.705051 1.681920 1.895587 2.148641    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## knn  0.1556778 0.2418591 0.5541821 0.4743023 0.6647513 0.7209630    0
## svm  0.4302169 0.5847018 0.6834006 0.6573563 0.7289782 0.8408634    0
## mars 0.3651698 0.4501465 0.6024610 0.5754682 0.6868517 0.7685032    0
## nnet 0.1631993 0.3515936 0.4320122 0.3985338 0.4704340 0.6250521    0

Now let’s calculate the prediction accuracy on the test data.

## $knn
##      RMSE  Rsquared       MAE 
## 1.1490426 0.5268106 0.9010625 
## 
## $svm
##      RMSE  Rsquared       MAE 
## 0.9209008 0.6970918 0.7439161 
## 
## $mars
##      RMSE  Rsquared       MAE 
## 1.0983490 0.5614890 0.8468836 
## 
## $nnet
##      RMSE  Rsquared       MAE 
## 1.9327283 0.3821101 1.3210742

From the results above, we can see that the SVM model predicts the test response with the best accuracy. It has R2 = 0.69 and RMSE = 0.92.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The list of important predictors is shown below.

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess13  100.00
## ManufacturingProcess32   99.64
## BiologicalMaterial06     83.61
## ManufacturingProcess17   78.70
## BiologicalMaterial03     77.88
## ManufacturingProcess36   74.36
## ManufacturingProcess09   69.04
## ManufacturingProcess06   59.92
## ManufacturingProcess33   51.79
## ManufacturingProcess11   46.74
## BiologicalMaterial08     42.75
## BiologicalMaterial11     42.55
## BiologicalMaterial01     42.47
## BiologicalMaterial09     40.49
## ManufacturingProcess02   37.58
## ManufacturingProcess12   36.38
## ManufacturingProcess30   32.79
## ManufacturingProcess20   27.36
## ManufacturingProcess15   23.26
## BiologicalMaterial10     23.22

As can be seen, the process variables dominate the list, with 8 variables in the top 10 and 5 variables in the next 10. In the previous assignment, we had found that the elastic net model provided the best fit.

Now, let’s re-check the key predictors for this model.

## glmnet variable importance
## 
##   only 20 most important variables shown (out of 46)
## 
##                        Overall
## ManufacturingProcess32 0.84511
## ManufacturingProcess09 0.37092
## ManufacturingProcess13 0.34844
## ManufacturingProcess06 0.19245
## ManufacturingProcess15 0.13481
## ManufacturingProcess07 0.12152
## ManufacturingProcess17 0.12095
## ManufacturingProcess39 0.10437
## ManufacturingProcess37 0.10047
## ManufacturingProcess04 0.08981
## ManufacturingProcess34 0.08618
## BiologicalMaterial03   0.08296
## ManufacturingProcess36 0.08243
## BiologicalMaterial05   0.07407
## ManufacturingProcess44 0.05441
## ManufacturingProcess28 0.01030
## ManufacturingProcess19 0.00000
## ManufacturingProcess23 0.00000
## ManufacturingProcess16 0.00000
## BiologicalMaterial10   0.00000

Comparing the top 10 predictors according to the best linear and non-linear models, we see that 5 of them are common across both lists. These are: ManufacturingProcess13, ManufacturingProcess32, ManufacturingProcess17,
ManufacturingProcess09 and ManufacturingProcess17. For the remaining 5, the SVM model selects other process variables, while the elastic net model selects 2 biological variables and 3 process variables.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Let’s look at the correlations between the top predictors and the response variable i.e. yield for the data.

##                              [,1]
## ManufacturingProcess13 -0.5264633
## ManufacturingProcess32  0.6180817
## BiologicalMaterial06    0.5030526
## ManufacturingProcess17 -0.4550352
## BiologicalMaterial03    0.4755898
## ManufacturingProcess36 -0.5339554
## ManufacturingProcess09  0.5020150
## ManufacturingProcess06  0.4467170
## ManufacturingProcess33  0.4456051
## ManufacturingProcess11  0.3520161

From the above we can see that 7 of the top 10 predictors (for example: ManufacturingProcess32) are positively correlated with yield while 3 of the top 10 predictors (for example: ManufacturingProcess13 are negatively correlated with yield. For the positive coefficients, ManufacturingProcess32 improves the yield significantly,with a correlation coefficient of 61%, indicating the mean increase of the yield for every additional unit of ManufacturingProcess32. For the negative coefficients, ManufacturingProcess13 improves the yield the most with a correlation coefficient of -52%, indicating the mean decrease in the yield for every additional unit of ManufacturingProcess13.