Textbook: Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, New York, 2013.
# Required R packages
library(tidyverse)
library(AppliedPredictiveModeling)
library(mlbench)
library(caret)
library(kableExtra)Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\]
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData = mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x = data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData = mlbench.friedman1(5000, sd = 1)
testData$x = data.frame(testData$x)Tune several models on these data.
K-Nearest Neighbor classification works such that for each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random.
set.seed(525)
knnModel = train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.180070 0.5892193 2.642994
## 7 3.064254 0.6415196 2.520158
## 9 3.032960 0.6767063 2.443912
## 11 2.993264 0.7024572 2.407608
## 13 3.026622 0.7064391 2.429608
## 15 3.046360 0.7104656 2.449903
## 17 3.057284 0.7219624 2.461754
## 19 3.067420 0.7304459 2.464944
## 21 3.089007 0.7362900 2.494963
## 23 3.120123 0.7370889 2.534550
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
## k
## 4 11
data.frame(rsquared = knnModel[["results"]][["Rsquared"]][as.numeric(rownames(knnModel$bestTune))],
rmse = knnModel[["results"]][["RMSE"]][as.numeric(rownames(knnModel$bestTune))])## rsquared rmse
## 1 0.7024572 2.993264
The best tune for the KNN model which resulted in the smallest root mean squared error is 11. It has RMSE = 2.99, and \(R^2\) = 0.70. While it does not account for the largest portion of the variability in the data than all other latent variables, it produces the smallest error. Moreover, the residuals are quite large and the top informative predictors are X4, X1, X2, X5, X3, and a few more.
Support vector machines carry out general regression and classification (of nu and epsilon-type), as well as density-estimation. The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements.
set.seed(525)
svmModel = train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.522735 0.7878795 1.998535
## 0.50 2.260320 0.8038050 1.782628
## 1.00 2.058638 0.8297378 1.625673
## 2.00 1.940958 0.8447382 1.522595
## 4.00 1.869965 0.8533352 1.460183
## 8.00 1.853925 0.8557931 1.450413
## 16.00 1.857397 0.8554917 1.453411
## 32.00 1.857397 0.8554917 1.453411
## 64.00 1.857397 0.8554917 1.453411
## 128.00 1.857397 0.8554917 1.453411
## 256.00 1.857397 0.8554917 1.453411
## 512.00 1.857397 0.8554917 1.453411
## 1024.00 1.857397 0.8554917 1.453411
## 2048.00 1.857397 0.8554917 1.453411
##
## Tuning parameter 'sigma' was held constant at a value of 0.0699987
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0699987 and C = 8.
## sigma C
## 6 0.0699987 8
data.frame(rsquared = svmModel[["results"]][["Rsquared"]][as.numeric(rownames(svmModel$bestTune))],
rmse = svmModel[["results"]][["RMSE"]][as.numeric(rownames(svmModel$bestTune))])## rsquared rmse
## 1 0.8557931 1.853925
RMSE was used to select the optimal model using the smallest value. The best tune for the SVM model which resulted in the smallest root mean squared error is 8. The tuning parameter ‘sigma’ was held constant at a value of 0.070. It has RMSE = 1.85, and \(R^2\) = 0.860. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, the top informative predictors are X4, X1, X2, X5, X3, and a few more.
MARS is an algorithm that essentially creates a piecewise linear model which provides an intuitive stepping block into non-linearity after grasping the concept of linear regression and other intrinsically linear models. Two tuning parameters associated with the MARS model is done to identify the optimal combination of these hyperparameters that minimize prediction error.
set.seed(525)
marsGrid = expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel = train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv",
number = 10))## nprune degree
## 48 12 2
data.frame(rsquared = marsModel[["results"]][["Rsquared"]][as.numeric(rownames(marsModel$bestTune))],
rmse = marsModel[["results"]][["RMSE"]][as.numeric(rownames(marsModel$bestTune))])## rsquared rmse
## 1 0.9339054 1.269078
RMSE was used to select the optimal model using the smallest value. The best tune for the MARS model which resulted in the smallest root mean squared error is with 2 degrees of interactions and the number of retained terms of 12. It has RMSE = 1.27, and \(R^2\) = 0.933. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, the top informative predictors are X1, X4, X2, and X5, less than before two models. The residuals are also smaller than the KNN model.
Neural Networks are a machine learning framework that attempts to mimic the learning pattern of natural biological neural networks. As with MARS, tuning is done by creating a specific candidate set of models to evaluate.
set.seed(525)
nnetGrid = expand.grid(decay = c(0, 0.01, .1),
size = c(1:10), bag = FALSE)
nnetModel = train(x = trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv",
number = 10),
preProc = c("center", "scale"),
linout = TRUE, trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)## size decay bag
## 4 4 0 FALSE
data.frame(rsquared = nnetModel[["results"]][["Rsquared"]][as.numeric(rownames(nnetModel$bestTune))],
rmse = nnetModel[["results"]][["RMSE"]][as.numeric(rownames(nnetModel$bestTune))])## rsquared rmse
## 1 0.7506085 2.462835
RMSE was used to select the optimal model using the smallest value. The best tune for the NNET model which resulted in the smallest root mean squared error is with the number of units in the hidden layer being 4 and the regularization parameter to avoid over-fitting is 0. It has RMSE = 2.46, and \(R^2\) = 0.750. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, the top informative predictors are X4, X1, X2, X5, X3, and a few more. The residuals are also somewhat larger than the MARS model.
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
set.seed(525)
knnPred = predict(knnModel, newdata = testData$x)
svmRadialPred = predict(svmModel, newdata = testData$x)
marsPred = predict(marsModel, newdata = testData$x)
nnetPred = predict(nnetModel, newdata = testData$x)
data.frame(rbind(KNN = postResample(pred = knnPred, obs = testData$y),
SVM = postResample(pred = svmRadialPred, obs = testData$y),
MARS = postResample(pred = marsPred, obs = testData$y),
NNET = postResample(pred = nnetPred, obs = testData$y)))## RMSE Rsquared MAE
## KNN 3.122264 0.6690472 2.496365
## SVM 2.084905 0.8240800 1.584360
## MARS 1.280306 0.9335241 1.016867
## NNET 2.321289 0.7969644 1.687902
From the results above, it suggests that the MARS model had a explain a larger portion of the variability with X1-X5 informative predictors. It resulted in a root mean squared error that is the smallest among the models with the test data, RMSE = 1.28. It can therefore be stated that the Multivariate Adaptive Regression Splines model best fitted the training data than the K-Nearest Neighbors, Support Vector Machines, and Neural Networks models.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
From Home Work #6, the matrix ChemicalManufacturingProcess containing the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs was used.
The variable description highlights that some variables have less than n = 176, these are the variables with missing values that must be imputed. Moreover, they’re quite a few variables that are heavily skewed, this suggests further transformation for normality is needed.
data(ChemicalManufacturingProcess)
psych::describe(ChemicalManufacturingProcess)[,-c(1,5,6,10,13)] %>%
kable() %>%
kable_styling(bootstrap_options = "striped") %>%
scroll_box(width = "100%", height = "300px")| n | mean | sd | mad | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| Yield | 176 | 40.1765341 | 1.8456664 | 1.9718580 | 35.250 | 46.340 | 0.3109596 | -0.1132944 |
| BiologicalMaterial01 | 176 | 6.4114205 | 0.7139225 | 0.6745830 | 4.580 | 8.810 | 0.2733165 | 0.4567758 |
| BiologicalMaterial02 | 176 | 55.6887500 | 4.0345806 | 4.5812340 | 46.870 | 64.750 | 0.2441269 | -0.7050911 |
| BiologicalMaterial03 | 176 | 67.7050000 | 4.0010641 | 4.2773010 | 56.970 | 78.250 | 0.0285108 | -0.1235203 |
| BiologicalMaterial04 | 176 | 12.3492614 | 1.7746607 | 1.3714050 | 9.380 | 23.090 | 1.7323153 | 7.0564614 |
| BiologicalMaterial05 | 176 | 18.5986364 | 1.8441408 | 1.8829020 | 13.240 | 24.850 | 0.3040053 | 0.2198005 |
| BiologicalMaterial06 | 176 | 48.9103977 | 3.7460718 | 3.9437160 | 40.600 | 59.380 | 0.3685344 | -0.3654933 |
| BiologicalMaterial07 | 176 | 100.0141477 | 0.1077423 | 0.0000000 | 100.000 | 100.830 | 7.3986642 | 53.0417012 |
| BiologicalMaterial08 | 176 | 17.4947727 | 0.6769536 | 0.5930400 | 15.880 | 19.140 | 0.2200539 | 0.0627721 |
| BiologicalMaterial09 | 176 | 12.8500568 | 0.4151757 | 0.4225410 | 11.440 | 14.080 | -0.2684177 | 0.2927765 |
| BiologicalMaterial10 | 176 | 2.8006250 | 0.5991433 | 0.4003020 | 1.770 | 6.870 | 2.4023783 | 11.6471845 |
| BiologicalMaterial11 | 176 | 146.9531818 | 4.8204704 | 4.1142150 | 135.810 | 158.730 | 0.3588211 | 0.0162456 |
| BiologicalMaterial12 | 176 | 20.1998864 | 0.7735440 | 0.6671700 | 18.350 | 22.210 | 0.3038443 | 0.0146595 |
| ManufacturingProcess01 | 175 | 11.2074286 | 1.8224342 | 1.0378200 | 0.000 | 14.100 | -3.9201855 | 21.8688069 |
| ManufacturingProcess02 | 173 | 16.6826590 | 8.4715694 | 1.4826000 | 0.000 | 22.500 | -1.4307675 | 0.1062466 |
| ManufacturingProcess03 | 161 | 1.5395652 | 0.0223983 | 0.0148260 | 1.470 | 1.600 | -0.4799447 | 1.7280557 |
| ManufacturingProcess04 | 175 | 931.8514286 | 6.2744406 | 5.9304000 | 911.000 | 946.000 | -0.6979357 | 0.0631282 |
| ManufacturingProcess05 | 175 | 1001.6931429 | 30.5272134 | 17.3464200 | 923.000 | 1175.300 | 2.5872769 | 11.7446904 |
| ManufacturingProcess06 | 174 | 207.4017241 | 2.6993999 | 1.9273800 | 203.000 | 227.400 | 3.0419007 | 17.3764864 |
| ManufacturingProcess07 | 175 | 177.4800000 | 0.5010334 | 0.0000000 | 177.000 | 178.000 | 0.0793788 | -2.0050587 |
| ManufacturingProcess08 | 175 | 177.5542857 | 0.4984706 | 0.0000000 | 177.000 | 178.000 | -0.2165645 | -1.9642262 |
| ManufacturingProcess09 | 176 | 45.6601136 | 1.5464407 | 1.2157320 | 38.890 | 49.360 | -0.9406685 | 3.2701986 |
| ManufacturingProcess10 | 167 | 9.1790419 | 0.7666884 | 0.5930400 | 7.500 | 11.600 | 0.6492504 | 0.6317264 |
| ManufacturingProcess11 | 166 | 9.3855422 | 0.7157336 | 0.6671700 | 7.500 | 11.500 | -0.0193109 | 0.3227966 |
| ManufacturingProcess12 | 175 | 857.8114286 | 1784.5282624 | 0.0000000 | 0.000 | 4549.000 | 1.5786729 | 0.4951353 |
| ManufacturingProcess13 | 176 | 34.5079545 | 1.0152800 | 0.8895600 | 32.100 | 38.600 | 0.4802776 | 1.9593883 |
| ManufacturingProcess14 | 175 | 4853.8685714 | 54.5236412 | 40.0302000 | 4701.000 | 5055.000 | -0.0109687 | 1.0781378 |
| ManufacturingProcess15 | 176 | 6038.9204545 | 58.3125023 | 40.7715000 | 5904.000 | 6233.000 | 0.6743478 | 1.2162163 |
| ManufacturingProcess16 | 176 | 4565.8011364 | 351.6973215 | 42.9954000 | 0.000 | 4852.000 | -12.4202248 | 158.3981993 |
| ManufacturingProcess17 | 176 | 34.3437500 | 1.2482059 | 1.1860800 | 31.300 | 40.000 | 1.1629715 | 4.6626982 |
| ManufacturingProcess18 | 176 | 4809.6818182 | 367.4777364 | 34.8411000 | 0.000 | 4971.000 | -12.7361378 | 163.7375845 |
| ManufacturingProcess19 | 176 | 6028.1988636 | 45.5785689 | 36.3237000 | 5890.000 | 6146.000 | 0.2973414 | 0.2962151 |
| ManufacturingProcess20 | 176 | 4556.4602273 | 349.0089784 | 42.9954000 | 0.000 | 4759.000 | -12.6383268 | 162.0663905 |
| ManufacturingProcess21 | 176 | -0.1642045 | 0.7782930 | 0.4447800 | -1.800 | 3.600 | 1.7291140 | 5.0274763 |
| ManufacturingProcess22 | 175 | 5.4057143 | 3.3306262 | 4.4478000 | 0.000 | 12.000 | 0.3148909 | -1.0175458 |
| ManufacturingProcess23 | 175 | 3.0171429 | 1.6625499 | 1.4826000 | 0.000 | 6.000 | 0.1967985 | -0.9975572 |
| ManufacturingProcess24 | 175 | 8.8342857 | 5.7994224 | 7.4130000 | 0.000 | 23.000 | 0.3593200 | -1.0207362 |
| ManufacturingProcess25 | 171 | 4828.1754386 | 373.4810865 | 34.0998000 | 0.000 | 4990.000 | -12.6310220 | 160.3293620 |
| ManufacturingProcess26 | 171 | 6015.5964912 | 464.8674900 | 38.5476000 | 0.000 | 6161.000 | -12.6694398 | 160.9849144 |
| ManufacturingProcess27 | 171 | 4562.5087719 | 353.9848679 | 35.5824000 | 0.000 | 4710.000 | -12.5174778 | 158.3931091 |
| ManufacturingProcess28 | 171 | 6.5918129 | 5.2489823 | 1.0378200 | 0.000 | 11.500 | -0.4556335 | -1.7907822 |
| ManufacturingProcess29 | 171 | 20.0111111 | 1.6638879 | 0.4447800 | 0.000 | 22.000 | -10.0848133 | 119.4378857 |
| ManufacturingProcess30 | 171 | 9.1614035 | 0.9760824 | 0.7413000 | 0.000 | 11.200 | -4.7557268 | 43.0848842 |
| ManufacturingProcess31 | 171 | 70.1847953 | 5.5557816 | 0.8895600 | 0.000 | 72.500 | -11.8231008 | 146.0094297 |
| ManufacturingProcess32 | 176 | 158.4659091 | 5.3972456 | 4.4478000 | 143.000 | 173.000 | 0.2112252 | 0.0602714 |
| ManufacturingProcess33 | 171 | 63.5438596 | 2.4833813 | 1.4826000 | 56.000 | 70.000 | -0.1310030 | 0.2740324 |
| ManufacturingProcess34 | 171 | 2.4935673 | 0.0543910 | 0.0000000 | 2.300 | 2.600 | -0.2634497 | 1.0013075 |
| ManufacturingProcess35 | 171 | 495.5964912 | 10.8196874 | 8.8956000 | 463.000 | 522.000 | -0.1556154 | 0.4130958 |
| ManufacturingProcess36 | 171 | 0.0195731 | 0.0008739 | 0.0014826 | 0.017 | 0.022 | 0.1453141 | -0.0557822 |
| ManufacturingProcess37 | 176 | 1.0136364 | 0.4450828 | 0.4447800 | 0.000 | 2.300 | 0.3783578 | 0.0698597 |
| ManufacturingProcess38 | 176 | 2.5340909 | 0.6493753 | 0.0000000 | 0.000 | 3.000 | -1.6818052 | 3.9189211 |
| ManufacturingProcess39 | 176 | 6.8511364 | 1.5054943 | 0.1482600 | 0.000 | 7.500 | -4.2691214 | 16.4987895 |
| ManufacturingProcess40 | 175 | 0.0177143 | 0.0382885 | 0.0000000 | 0.000 | 0.100 | 1.6768073 | 0.8164458 |
| ManufacturingProcess41 | 175 | 0.0237143 | 0.0538242 | 0.0000000 | 0.000 | 0.200 | 2.1686898 | 3.6290714 |
| ManufacturingProcess42 | 176 | 11.2062500 | 1.9416092 | 0.2965200 | 0.000 | 12.100 | -5.4500082 | 28.5288867 |
| ManufacturingProcess43 | 176 | 0.9119318 | 0.8679860 | 0.2965200 | 0.000 | 11.000 | 9.0548747 | 101.0332345 |
| ManufacturingProcess44 | 176 | 1.8051136 | 0.3220062 | 0.1482600 | 0.000 | 2.100 | -4.9703552 | 25.0876065 |
| ManufacturingProcess45 | 176 | 2.1380682 | 0.4069043 | 0.1482600 | 0.000 | 2.600 | -4.0779411 | 18.7565001 |
A small percentage of cells in the predictor set contain missing values. Because these proportions are not too extreme for most of the variables, the imputation by k-Nearest Neighbor is conducted. The distance computation for defining the nearest neighbors is based on Gower distance (Gower, 1971), which can now handle distance variables of the type binary, categorical, ordered, continuous, and semi-continuous. As a result, the data set is now complete.
pre.process = preProcess(ChemicalManufacturingProcess[, -c(1)], method = "knnImpute")
chemical = predict(pre.process, ChemicalManufacturingProcess[, -c(1)])To build a smaller model without predictors with extremely high correlations, it is best to reduce the number of predictors such that there are no absolute pairwise correlations above 0.90. The list below shows only significant correlations (at 5% level) for the top 10 highest correlations by the correlation coefficient. The results show that these ten have a perfect correlation of 1. Afterward, the data is pre-processed to fulfill the assumption of normality using the Yeo-Johnson transformation (Yeo & Johnson, 2000). This technique attempts to find the value of lambda that minimizes the Kullback-Leibler distance between the normal distribution and the transformed distribution. This method has the advantage of working without having to worry about the domain of x.
corr = cor(chemical)
corr[corr == 1] = NA
corr[abs(corr) < 0.85] = NA
corr = na.omit(reshape::melt(corr))
head(corr[order(-abs(corr$value)),], 10)## X1 X2 value
## 2090 ManufacturingProcess26 ManufacturingProcess25 0.9975339
## 2146 ManufacturingProcess25 ManufacturingProcess26 0.9975339
## 2148 ManufacturingProcess27 ManufacturingProcess26 0.9960721
## 2204 ManufacturingProcess26 ManufacturingProcess27 0.9960721
## 2091 ManufacturingProcess27 ManufacturingProcess25 0.9934932
## 2203 ManufacturingProcess25 ManufacturingProcess27 0.9934932
## 1685 ManufacturingProcess20 ManufacturingProcess18 0.9917474
## 1797 ManufacturingProcess18 ManufacturingProcess20 0.9917474
## 2095 ManufacturingProcess31 ManufacturingProcess25 0.9706780
## 2431 ManufacturingProcess25 ManufacturingProcess31 0.9706780
tooHigh = findCorrelation(cor(chemical), 0.90)
chemical = chemical[, -tooHigh]
(pre.process = preProcess(chemical, method = c("YeoJohnson", "center", "scale")))## Created from 176 samples and 47 variables
##
## Pre-processing:
## - centered (47)
## - ignored (0)
## - scaled (47)
## - Yeo-Johnson transformation (41)
##
## Lambda estimates for Yeo-Johnson transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.8404 0.7085 0.8795 0.9698 1.2484 2.7992
Next, the data is split into training and testing data in a ratio of 4:1, and an elastic net model is fitted.
set.seed(525)
intrain = createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
train.p = chemical[intrain, ]
train.r = ChemicalManufacturingProcess$Yield[intrain]
test.p = chemical[-intrain, ]
test.r = ChemicalManufacturingProcess$Yield[-intrain]Which nonlinear regression model gives the optimal resampling and test set performance?
set.seed(525)
knnModel = train(x = train.p,
y = train.r,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv"))## k-Nearest Neighbors
##
## 144 samples
## 47 predictor
##
## Pre-processing: centered (47), scaled (47)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 128, 128, 130, 131, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.276353 0.5232341 1.030015
## 7 1.325396 0.4853696 1.080400
## 9 1.372812 0.4527093 1.107609
## 11 1.361426 0.4803417 1.119428
## 13 1.367557 0.4818801 1.121799
## 15 1.391723 0.4602591 1.133290
## 17 1.391383 0.4664510 1.136789
## 19 1.391150 0.4727794 1.129443
## 21 1.406613 0.4692706 1.145596
## 23 1.428708 0.4567568 1.165556
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
## k
## 1 5
data.frame(rsquared = knnModel[["results"]][["Rsquared"]][as.numeric(rownames(knnModel$bestTune))],
rmse = knnModel[["results"]][["RMSE"]][as.numeric(rownames(knnModel$bestTune))])## rsquared rmse
## 1 0.5232341 1.276353
The best tune for the KNN model which resulted in the smallest root mean squared error is 5 It has RMSE = 1.30, and \(R^2\) = 0.52. This model accounts for the largest portion of the variability in the data than all other latent variables, as well as produces the smallest error. Moreover, the residuals are quite small and there are quite a few top informative predictors over 70%.
set.seed(525)
svmModel = train(x = train.p,
y = train.r,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 47 predictor
##
## Pre-processing: centered (47), scaled (47)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 130, 128, 128, 130, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.376323 0.5084102 1.1180025
## 0.50 1.270768 0.5418885 1.0365006
## 1.00 1.203217 0.5675698 0.9732397
## 2.00 1.140920 0.5941137 0.9159240
## 4.00 1.133136 0.5930829 0.9077717
## 8.00 1.093984 0.6193074 0.8800622
## 16.00 1.088150 0.6237850 0.8801737
## 32.00 1.088150 0.6237850 0.8801737
## 64.00 1.088150 0.6237850 0.8801737
## 128.00 1.088150 0.6237850 0.8801737
## 256.00 1.088150 0.6237850 0.8801737
## 512.00 1.088150 0.6237850 0.8801737
## 1024.00 1.088150 0.6237850 0.8801737
## 2048.00 1.088150 0.6237850 0.8801737
##
## Tuning parameter 'sigma' was held constant at a value of 0.01271237
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01271237 and C = 16.
## sigma C
## 7 0.01271237 16
data.frame(rsquared = svmModel[["results"]][["Rsquared"]][as.numeric(rownames(svmModel$bestTune))],
rmse = svmModel[["results"]][["RMSE"]][as.numeric(rownames(svmModel$bestTune))])## rsquared rmse
## 1 0.623785 1.08815
RMSE was used to select the optimal model using the smallest value. The best tune for the SVM model which resulted in the smallest root mean squared error is 16. The tuning parameter ‘sigma’ was held constant at a value of 0.013. It has RMSE = 1.09, and \(R^2\) = 0.62. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, with it’s slightly improved fit than KNN, the top informative predictors and residual are quite similar.
set.seed(525)
marsGrid = expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel = train(x = train.p,
y = train.r,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv",
number = 10))## nprune degree
## 2 3 1
data.frame(rsquared = marsModel[["results"]][["Rsquared"]][as.numeric(rownames(marsModel$bestTune))],
rmse = marsModel[["results"]][["RMSE"]][as.numeric(rownames(marsModel$bestTune))])## rsquared rmse
## 1 0.427616 1.403656
RMSE was used to select the optimal model using the smallest value. The best tune for the MARS model which resulted in the smallest root mean squared error is with 1 degree of interactions and the number of retained terms of 3 It has RMSE = 1.40, and \(R^2\) = 0.43. In this case, it does account for the largest portion of the variability in the data than all other variables, and it produces the smallest error. Moreover, there are only 3 top informative predictors, less than two models before. The residuals are a bit larger than the two models.
set.seed(525)
nnetGrid = expand.grid(decay = c(0, 0.01, .1),
size = c(1,5,10), bag = FALSE)
nnetModel = train(x = train.p,
y = train.r,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv",
number = 10),
preProc = c("center", "scale"),
linout = TRUE, trace = FALSE,
maxit = 10)## size decay bag
## 5 5 0.01 FALSE
data.frame(rsquared = nnetModel[["results"]][["Rsquared"]][as.numeric(rownames(nnetModel$bestTune))],
rmse = nnetModel[["results"]][["RMSE"]][as.numeric(rownames(nnetModel$bestTune))])## rsquared rmse
## 1 0.4227157 1.664697
RMSE was used to select the optimal model using the smallest value. The best tune for the NNET model which resulted in the smallest root mean squared error is with the number of units in the hidden layer being 5 and the regularization parameter to avoid over-fitting is 0.01. It has RMSE = 1.66, and \(R^2\) = 0.42. In this case, it does account for the largest portion of the variability in the data than all other variables, even though that is only 42%, and it produces the smallest error. Moreover, there a few top informative predictors, and the residuals are also somewhat larger than the previous models.
By conducting a resampling method, performance metrics were collected and analyzed to determine the model that best fits the training data. The results below suggest that the SVM model had the largest mean \(R^2\) = 0.62 from the 10 sample cross-validations. In fact, the SVM model also produced the smallest errors, RMSE = 1.09. It can therefore be stated that the SVM model best fitted the training data than the KNN, MARS, and NNET models.
set.seed(525)
summary(resamples(list(knn = knnModel, svm = svmModel, mars = marsModel, nnet = nnetModel)))##
## Call:
## summary.resamples(object = resamples(list(knn = knnModel, svm = svmModel,
## mars = marsModel, nnet = nnetModel)))
##
## Models: knn, svm, mars, nnet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## knn 0.8205714 0.9127688 0.9567458 1.0300152 1.1076209 1.412714 0
## svm 0.7375941 0.7761412 0.8722630 0.8801737 0.9182906 1.227994 0
## mars 0.5347481 0.8252692 0.9761031 0.9373961 1.0232285 1.396016 0
## nnet 1.0374207 1.1749932 1.2837866 1.3148196 1.3825707 1.829582 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## knn 0.9899123 1.1425919 1.189319 1.276353 1.437678 1.758019 0
## svm 0.9343881 0.9958915 1.004286 1.088150 1.131162 1.588976 0
## mars 0.7098012 1.0504415 1.211952 1.172784 1.304896 1.617148 0
## nnet 1.3375485 1.5508655 1.580553 1.664697 1.841105 2.070316 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## knn 0.2067552 0.4382349 0.5377369 0.5232341 0.6015175 0.7757001 0
## svm 0.3414645 0.5593522 0.6423880 0.6237850 0.7208777 0.8181331 0
## mars 0.3395894 0.5019244 0.5806932 0.5944080 0.6802027 0.8594222 0
## nnet 0.2113228 0.2890505 0.3830814 0.4227157 0.5513093 0.6953579 0
Now, let’s use the best model to make predictions with the test predictive variables and compare the accuracy to the actual test responses.
accuracy = function(models, predictors, response){
acc = list()
i = 1
for (model in models){
predictions = predict(model, newdata = predictors)
acc[[i]] = postResample(pred = predictions, obs = response)
i = i + 1
}
names(acc) = c("knn", "svm", "mars", "nnet")
return(acc)
}
models = list(knnModel, svmModel, marsModel, nnetModel)
accuracy(models, test.p, test.r)## $knn
## RMSE Rsquared MAE
## 1.1946539 0.6152055 1.0178750
##
## $svm
## RMSE Rsquared MAE
## 0.9082351 0.7625658 0.7875064
##
## $mars
## RMSE Rsquared MAE
## 1.2124432 0.5938139 1.0157364
##
## $nnet
## RMSE Rsquared MAE
## 2.0270316 0.2110275 1.7287088
From the results, it can be concluded that the SVM model predicted the test response with the best accuracy. It has \(R^2\) = 0.76 and RMSE = 0.91.
Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
The top 20 most important variable out 47 is ranked below. The caret::varImp calculation of variable importance for regression is the relationship between each predictor and the outcome from a linear model fit and the \(R^2\) statistic is calculated for this model against the intercept only null model. This number is returned as a relative measure of variable importance. The list shows that there are a few variables that are shrunk to zero, and the most contribution variable is ManufacturingProcess32. As a result, it shows that Manufacturing Processes dominate the list.
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 47)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 99.89
## BiologicalMaterial06 86.57
## ManufacturingProcess17 85.28
## ManufacturingProcess09 77.19
## BiologicalMaterial03 76.20
## ManufacturingProcess36 68.01
## ManufacturingProcess06 64.79
## ManufacturingProcess33 53.20
## ManufacturingProcess02 47.82
## BiologicalMaterial11 47.81
## ManufacturingProcess12 45.80
## ManufacturingProcess11 43.61
## BiologicalMaterial08 42.44
## BiologicalMaterial01 41.31
## ManufacturingProcess30 35.22
## BiologicalMaterial09 30.69
## ManufacturingProcess28 23.14
## ManufacturingProcess20 22.38
## BiologicalMaterial10 22.03
It was the elastic net linear model that best fitted the data, and it only needed 15 variables of the 47. While Manufacturing Processes still dominated the list, their ranks are different than those compared to the nonlinear model.
# Elastic Net Regression
elastic.model = train(x = train.p, y = train.r, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10, metric = "Rsquared"
)## glmnet variable importance
##
## only 20 most important variables shown (out of 47)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess09 46.184
## ManufacturingProcess17 37.704
## ManufacturingProcess13 30.138
## ManufacturingProcess06 24.135
## ManufacturingProcess34 17.305
## BiologicalMaterial06 17.267
## ManufacturingProcess36 14.035
## ManufacturingProcess39 13.275
## ManufacturingProcess37 11.931
## ManufacturingProcess30 8.748
## BiologicalMaterial03 8.174
## ManufacturingProcess07 6.940
## BiologicalMaterial05 5.812
## ManufacturingProcess15 2.395
## BiologicalMaterial07 0.000
## BiologicalMaterial09 0.000
## BiologicalMaterial08 0.000
## ManufacturingProcess38 0.000
## ManufacturingProcess28 0.000
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
From the plots below, it is apparent that each manufacturing process and biological material have a different fit with yield. Looking at the slope of the linear line, increasing some features will most likely increase the yield, e.g. ManufacturingProcess32 and BiologicalMaterial06. While increase others will decrease the yield, e.g. ManufacturingProcess13.
temp = svm.v$importance
temp$predictor = row.names(temp)
temp = temp[order(temp$Overall, decreasing = TRUE),]
variables = row.names(temp[1:10,])
par(mfrow = c(5,2))
for (i in 1:10){
x = ChemicalManufacturingProcess[, variables[i]]
y = ChemicalManufacturingProcess$Yield
plot(x, y, xlab = variables[i], ylab = 'Yield')
abline(lm(y~x))
}## [,1]
## ManufacturingProcess32 0.5892293
## ManufacturingProcess13 -0.5165083
## BiologicalMaterial06 0.4873929
## ManufacturingProcess17 -0.4397274
## ManufacturingProcess09 0.5078655
## BiologicalMaterial03 0.4422623
## ManufacturingProcess36 -0.4859269
## ManufacturingProcess06 0.4222770
## ManufacturingProcess33 0.4297605
## ManufacturingProcess02 -0.2011752
For the positive coefficients, ManufacturingProcess32 improved the yield tremendously, and it also has the highest, positive correlation than the other variables in the model. The ManufacturingProcess32 coefficient in the regression equation is 0.59. This coefficient represents the mean increase of the yield for every additional unit of ManufacturingProcess32. For the negative coefficients, ManufacturingProcess13 improved the yield tremendously, and it also has the lowest correlation than the other variables in the model. The ManufacturingProcess13 coefficient in the regression equation is -0.52. This coefficient represents the mean decrease of the yield for every additional unit of ManufacturingProcess13.