Applied Predictive Modeling (Kuhn & Johnson)

Non-Linear Regression Models

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[ y = 10sin(\pi x_1x_2)+20(x_3−0.5)^2+10x_4+5x_5+N(0,\sigma^2)\] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
library(caret)
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

# Tune several models on these data. For example:

library(caret)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.565620  0.4887976  2.886629
##    7  3.422420  0.5300524  2.752964
##    9  3.368072  0.5536927  2.715310
##   11  3.323010  0.5779056  2.669375
##   13  3.275835  0.6030846  2.628663
##   15  3.261864  0.6163510  2.621192
##   17  3.261973  0.6267032  2.616956
##   19  3.286299  0.6281075  2.640585
##   21  3.280950  0.6390386  2.643807
##   23  3.292397  0.6440392  2.656080
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.

knnPred <- predict(knnModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set
## perforamnce values
knnperf <- postResample(pred = knnPred, obs = testData$y)

# Neural Networks

nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1),
                        .size=c(1,5,10),
                        .bag=FALSE)

nnetModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  preProc = c("center", "scale"),
                  trace=FALSE,
                  linout=TRUE,
                  maxit=500)

## Warning: executing %dopar% sequentially: no parallel backend registered

nnetModel

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.633298  0.7311434  2.062484
##   0.00    5    3.285459  0.6470466  2.379608
##   0.00   10    2.903921  0.6844344  2.274391
##   0.01    1    2.597128  0.7419913  2.019047
##   0.01    5    2.536430  0.7544127  1.999224
##   0.01   10    2.753848  0.7126363  2.175934
##   0.10    1    2.595957  0.7411858  2.013476
##   0.10    5    2.485368  0.7622012  1.972273
##   0.10   10    2.513428  0.7562287  1.989455
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0.1 and bag
##  = FALSE.

nnetPred <- predict(nnetModel, newdata = testData$x)
nnetperf <- postResample(pred = nnetPred, obs = testData$y)

# MARS

library(earth)
marsGrid <- expand.grid(.degree=1:2,
                        .nprune=2:20)

marsModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  preProc = c("center", "scale"))
marsModel

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.459160  0.2269009  3.648621
##   1        3      3.703808  0.4627759  2.998624
##   1        4      2.788330  0.6952911  2.247361
##   1        5      2.552061  0.7438976  2.038840
##   1        6      2.398227  0.7737511  1.917525
##   1        7      1.956515  0.8489167  1.536833
##   1        8      1.859780  0.8631867  1.447584
##   1        9      1.768654  0.8765206  1.374287
##   1       10      1.764931  0.8775507  1.356467
##   1       11      1.779741  0.8766418  1.376192
##   1       12      1.774808  0.8772249  1.372218
##   1       13      1.805089  0.8726829  1.397210
##   1       14      1.819615  0.8711360  1.406696
##   1       15      1.835221  0.8695754  1.416871
##   1       16      1.840524  0.8687417  1.422030
##   1       17      1.842401  0.8683960  1.425353
##   1       18      1.842401  0.8683960  1.425353
##   1       19      1.842401  0.8683960  1.425353
##   1       20      1.842401  0.8683960  1.425353
##   2        2      4.471734  0.2244806  3.647081
##   2        3      3.714218  0.4599572  3.004844
##   2        4      2.861317  0.6777013  2.315312
##   2        5      2.553105  0.7439500  2.051875
##   2        6      2.446188  0.7645488  1.949440
##   2        7      2.053872  0.8319061  1.614748
##   2        8      1.861883  0.8626461  1.455725
##   2        9      1.730611  0.8802498  1.353077
##   2       10      1.600061  0.8971990  1.254381
##   2       11      1.511413  0.9084055  1.189547
##   2       12      1.542350  0.9052716  1.187687
##   2       13      1.509975  0.9103798  1.163212
##   2       14      1.467450  0.9149851  1.139031
##   2       15      1.475360  0.9139041  1.147567
##   2       16      1.490228  0.9115933  1.145509
##   2       17      1.492258  0.9109130  1.142455
##   2       18      1.490088  0.9110852  1.139128
##   2       19      1.489183  0.9112096  1.138835
##   2       20      1.489997  0.9110402  1.139272
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

marsPred <- predict(marsModel, newdata = testData$x)
marsperf <- postResample(pred = marsPred, obs = testData$y)

# SVM
svmModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.598750  0.7792058  2.072381
##     0.50  2.377193  0.7917602  1.885072
##     1.00  2.238917  0.8081062  1.765140
##     2.00  2.168225  0.8184332  1.700831
##     4.00  2.136811  0.8225716  1.669194
##     8.00  2.132541  0.8229563  1.666091
##    16.00  2.133316  0.8228263  1.666654
##    32.00  2.133316  0.8228263  1.666654
##    64.00  2.133316  0.8228263  1.666654
##   128.00  2.133316  0.8228263  1.666654
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06773352
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06773352 and C = 8.

svmPred <- predict(svmModel, newdata = testData$x)
svmperf <- postResample(pred = svmPred, obs = testData$y)

library(knitr)
kable(data.frame("KNN"=knnperf["RMSE"], "NNET"=nnetperf["RMSE"], "MARS"=marsperf["RMSE"], "SVM"=svmperf["RMSE"]))

	KNN	NNET	MARS	SVM
RMSE	3.175066	2.170368	1.277999	2.075607

varImp(marsModel)

## earth variable importance
## 
##     Overall
## X1   100.00
## X4    84.98
## X2    68.87
## X5    48.55
## X3    38.96
## X7     0.00
## X9     0.00
## X8     0.00
## X10    0.00
## X6     0.00

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1-X5)?

MARS model showed the best performance amongst all different algorithms with and RMSE of 1.277999. It is confirmed it selected predictors X1:X5 as the most informative.

Exercise 7.5

Exercise 6.3 describes data for chemical manufacturing process.

(A chemical manufacturing process for a pharmaceutical product was discussed in Sect. 1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch.)

Use the same data imputation, data splitting and pre-processing steps as before and train several nonlinear regression models.

Data loading, exploration, splitting & pre-processing/imputation

#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

library(caret)

chmp <- ChemicalManufacturingProcess
predictors <- chmp[,-1]

# Correlations - Find most correlated predictors

corr <- cor(predictors, use='complete.obs')
topcorr <- findCorrelation(corr) 

# Zero to Near-zero variance predictors check

nzv <- nearZeroVar(predictors)

# Final predictors to be considered for modeling (minus top correlated and near-zero variance)

predictors <- predictors[,-c(nzv, topcorr)]
yield <- as.data.frame(chmp[,1])
colnames(yield) <- c("yield")

# Splitting Train and Test datasets

library(caret)

set.seed(500)
train <- createDataPartition(yield$yield, p = 0.7, list = FALSE)

trainPredictors <- predictors[train,]
trainYield <- yield[train,]

testPredictors <- predictors[-train,]
testYield <- yield[-train,]

# Pre-processing

transtrain <- preProcess(trainPredictors, method=c("BoxCox","center","scale", "knnImpute"))
transtest <- preProcess(testPredictors, method=c("BoxCox","center","scale", "knnImpute"))
transTrainPredictors <- predict(transtrain,trainPredictors)
transTestPredictors <- predict(transtest,testPredictors)

Modeling & Validation

Linear Model (PLS) - For Reference

ctrl <- trainControl(method = "boot", number = 25)
pls_tune <- train(x = transTrainPredictors, y = trainYield,
                 method = "pls",
                 tuneLength = 15,
                 trControl = ctrl)
pls_tune

## Partial Least Squares 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     1.351010  0.4485164  1.102844
##    2     1.260323  0.5154151  1.016047
##    3     1.273289  0.5153561  1.028123
##    4     1.309761  0.4989806  1.059869
##    5     1.347540  0.4830462  1.084036
##    6     1.382643  0.4660665  1.108433
##    7     1.426869  0.4437381  1.147465
##    8     1.466832  0.4273872  1.179911
##    9     1.522083  0.4010375  1.228305
##   10     1.569062  0.3827168  1.264325
##   11     1.621665  0.3653264  1.296599
##   12     1.659835  0.3524162  1.318829
##   13     1.703314  0.3384623  1.343085
##   14     1.744742  0.3275454  1.364165
##   15     1.794207  0.3150951  1.391282
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.

plsTest <- data.frame(obs=testYield,pred=predict(pls_tune,transTestPredictors))
plsperf <- defaultSummary(plsTest)

# Important Predictors

plsImp <- varImp(pls_tune, scale = FALSE)
plot(plsImp, top=10)

Nonlinear Models (KNN, Neural Networks, MARS & SVM)

library(caret)
knnModel <- train(x = transTrainPredictors,
                  y = trainYield,
                  method = "knn",
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.402749  0.4048961  1.100072
##    7  1.381977  0.4197026  1.104513
##    9  1.382696  0.4216603  1.117713
##   11  1.377755  0.4270218  1.118546
##   13  1.392717  0.4211626  1.132558
##   15  1.393130  0.4275926  1.131653
##   17  1.399322  0.4260907  1.142817
##   19  1.409851  0.4216393  1.156186
##   21  1.418607  0.4178667  1.168636
##   23  1.420105  0.4270186  1.170369
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.

# Neural Networks

nnetGrid <- expand.grid(.decay=c(0, 0.01, 0.1),
                        .size=c(1,5,10),
                        .bag=FALSE)

nnetModel <- train(x = transTrainPredictors,
                  y = trainYield,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trace=FALSE,
                  linout=TRUE,
                  maxit=500)
nnetModel

## Model Averaged Neural Network 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    1.610693  0.3282347  1.317989
##   0.00    5    2.096759  0.2857749  1.662703
##   0.00   10    7.453411  0.1163407  4.777774
##   0.01    1    1.632921  0.3537768  1.287717
##   0.01    5    2.382653  0.2224624  1.692331
##   0.01   10    2.164522  0.3030359  1.667056
##   0.10    1    1.935140  0.3037360  1.474595
##   0.10    5    2.680296  0.2035901  1.783633
##   0.10   10    2.267248  0.2250593  1.679761
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0 and bag
##  = FALSE.

# MARS

library(earth)
marsGrid <- expand.grid(.degree=1:2,
                        .nprune=2:10)

marsModel <- train(x = transTrainPredictors,
                  y = trainYield,
                  method = "earth",
                  tuneGrid = marsGrid)
marsModel

## Multivariate Adaptive Regression Spline 
## 
## 124 samples
##  47 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      1.474974  0.3937771  1.183436
##   1        3      1.347717  0.4944946  1.077682
##   1        4      1.358573  0.4855082  1.088398
##   1        5      1.352871  0.4895013  1.089438
##   1        6      1.464143  0.4679334  1.121022
##   1        7      1.608554  0.4405143  1.168835
##   1        8      1.597393  0.4333546  1.168652
##   1        9      1.673360  0.4178674  1.197718
##   1       10      1.744609  0.4141018  1.224317
##   2        2      1.476052  0.3932149  1.184854
##   2        3      1.404857  0.4529665  1.125711
##   2        4      1.378437  0.4700282  1.096039
##   2        5      1.836718  0.4300026  1.201290
##   2        6      1.885691  0.4048881  1.209516
##   2        7      1.885759  0.3943905  1.221275
##   2        8      1.866741  0.3979609  1.216667
##   2        9      1.609065  0.4016889  1.175496
##   2       10      1.695258  0.3931507  1.214503
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.

# SVM
svmModel <- train(x = transTrainPredictors,
                  y = trainYield,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
svmModel

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 124 samples
##  47 predictor
## 
## Pre-processing: centered (47), scaled (47) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 124, 124, 124, 124, 124, 124, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE      
##     0.25  1.403416  0.4964341  1.1574251
##     0.50  1.311130  0.5279193  1.0770035
##     1.00  1.265853  0.5427493  1.0208705
##     2.00  1.259414  0.5441863  0.9918689
##     4.00  1.268521  0.5390152  0.9855121
##     8.00  1.271589  0.5359192  0.9834915
##    16.00  1.271569  0.5358966  0.9835233
##    32.00  1.271569  0.5358966  0.9835233
##    64.00  1.271569  0.5358966  0.9835233
##   128.00  1.271569  0.5358966  0.9835233
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0125532
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0125532 and C = 2.

Which nonlinear regression model gives the optimal resampling and test set performance?

Resampling Performance

resampl <- resamples(list(KNN=knnModel, NNet=nnetModel, MARS=marsModel, SVM=svmModel))
summary(resampl)

## 
## Call:
## summary.resamples(object = resampl)
## 
## Models: KNN, NNet, MARS, SVM 
## Number of resamples: 25 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## KNN  0.9058592 0.9870896 1.1127841 1.1185456 1.203046 1.558048    0
## NNet 1.0495308 1.2273437 1.3294560 1.3179887 1.385199 1.661394    0
## MARS 0.8684310 0.9876964 1.0447313 1.0776815 1.118345 1.522757    0
## SVM  0.7658697 0.9145722 0.9718791 0.9918689 1.070424 1.197459    0
## 
## RMSE 
##          Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## KNN  1.178216 1.234169 1.354816 1.377755 1.463125 1.929067    0
## NNet 1.285416 1.487737 1.629464 1.610693 1.747954 1.935362    0
## MARS 1.116609 1.234721 1.328910 1.347717 1.441420 1.774211    0
## SVM  1.056796 1.161601 1.216338 1.259414 1.352619 1.538968    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## KNN  0.2201409 0.3727362 0.4039894 0.4270218 0.4947506 0.5988764    0
## NNet 0.1031603 0.2770266 0.3469157 0.3282347 0.4008970 0.5310045    0
## MARS 0.3043747 0.4469323 0.4782430 0.4944946 0.5626903 0.6338061    0
## SVM  0.4526292 0.5040173 0.5460373 0.5441863 0.5853110 0.6747305    0

MARS model appears to be the most optimal amongst all the algorithms across all the performance metrics (MAE, RMSE & Rsquared). Closely followed by SVM

Test set Performance

knnPred <- predict(knnModel, newdata = transTestPredictors)
knnperf <- postResample(pred = knnPred, obs = testYield)
nnetPred <- predict(nnetModel, newdata = transTestPredictors)
nnetperf <- postResample(pred = nnetPred, obs = testYield)
marsPred <- predict(marsModel, newdata = transTestPredictors)
marsperf <- postResample(pred = marsPred, obs = testYield)
svmPred <- predict(svmModel, newdata = transTestPredictors)
svmperf <- postResample(pred = svmPred, obs = testYield)

library(knitr)
kable(data.frame("KNN"=knnperf, "NNET"=nnetperf, "MARS"=marsperf, "SVM"=svmperf, "PLS"=plsperf))

	KNN	NNET	MARS	SVM	PLS
RMSE	1.5946023	1.7669049	1.5386158	1.2414576	1.3105629
Rsquared	0.3451473	0.2782982	0.3755943	0.6073041	0.5551553
MAE	1.3030944	1.4193041	1.1283177	0.9816494	1.0267944

Interestingly, for the Test set, SVM model appears to be the most optimal amongst all the algorithms across all the performance metrics (RMSE, Rsquared & MAE). Closely followed by MARS. Another important point ot mention is that the linear model (PLS), actually comes second overall, above MARS.

Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

svmModel$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 2 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0125531978064151 
## 
## Number of Support Vectors : 113 
## 
## Objective Function Value : -60.9605 
## Training error : 0.106214

svmImp <- varImp(svmModel)
plot(svmImp, top=10)

plot(plsImp, top=10)

The most important predictors for the optimal nonlinear model (SVM) are primarily ManufacturingProcess predictors (7 out of 10). Interestingly, one of the 3 BiologicalMaterial predictors is the second most relevant for the model. Regarding the linear model (PLS), same proportion of ManufacturingProcess predictors in the top 10 (7 out of 10), but in this model, is more clearly defined the predominant influence over the BiologicalMaterial ones.

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

#Extracting predictors considered by SVM and not by PLS (within the top 20 for each model)

svmtoppred <- rownames(svmImp$importance)[order(abs(svmImp$importance),decreasing=TRUE)][1:20]
plstoppred <- rownames(plsImp$importance)[order(abs(plsImp$importance),decreasing=TRUE)][1:20]
uniquepred <- setdiff(svmtoppred, plstoppred)

#for (i in 1:length(uniquepred)){
#  plot(x=transTrainPredictors[,uniquepred[i]], y=trainYield, col="blue",xlab = uniquepred[i], ylab = "Yield")
#}

plot(x=transTrainPredictors[,uniquepred[1]], y=trainYield, col="blue",xlab = uniquepred[1], ylab = "Yield")

plot(x=transTrainPredictors[,uniquepred[2]], y=trainYield, col="blue",xlab = uniquepred[2], ylab = "Yield")

plot(x=transTrainPredictors[,uniquepred[3]], y=trainYield, col="blue",xlab = uniquepred[3], ylab = "Yield")

#plot(x=transTrainPredictors[,uniquepred[4]], y=trainYield, col="blue",xlab = uniquepred[4], ylab = "Yield")

There is a big overlap in the top 20 relevant predictors between the nonlinear and linear models, there are only 4 predictors that are unique in the nonlinear model, all of them ManufacturingProcess. Most of them do not reveal an evident relationship with the Yield response as the SVM radial kernel transforms the training data into a higher dimensional space to regress the coresponding response variable values.

HHernandez_DATA624_HW08

humbertohp

April 15, 2020