Chapter 7, Kuhn and Johnson, Applied Predictive Modeling

Exercise 7.2

(a) Create the data

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

#library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

(b) Tune models

Tune several models on these data.

For example:

knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to bet the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Tuning a Neural Network model

tooHigh <- findCorrelation(cor(trainingData$x), cutoff = 0.75)

nnetGrid <- expand.grid(decay = c(0.0, 0.01, 0.1),
                        size = c(1:10)
                        )

ctrl <- trainControl(method = "cv")

set.seed(100)
nnetTune <- train(trainingData$x,
                  trainingData$y,
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                  maxit = 500
                  )

nnetTune
## Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE      Rsquared   MAE     
##   0.00    1    2.540546  0.7254252  2.008197
##   0.00    2    2.655140  0.7062546  2.133980
##   0.00    3    2.308948  0.7717065  1.813897
##   0.00    4    2.268677  0.8065299  1.791016
##   0.00    5    2.491449  0.7556790  1.938877
##   0.00    6    3.445760  0.6172989  2.291658
##   0.00    7    5.259894  0.5137135  3.140884
##   0.00    8    5.096729  0.4494295  3.299397
##   0.00    9    6.724966  0.5040399  4.091065
##   0.00   10    3.529274  0.6008843  2.765820
##   0.01    1    2.385136  0.7603460  1.887587
##   0.01    2    2.583767  0.7260485  2.018814
##   0.01    3    2.282621  0.7815267  1.812073
##   0.01    4    2.274770  0.7901402  1.842674
##   0.01    5    2.653199  0.7237241  2.117160
##   0.01    6    2.753791  0.7153836  2.190768
##   0.01    7    2.799525  0.7123252  2.209083
##   0.01    8    3.342579  0.6050358  2.630422
##   0.01    9    3.795128  0.6115537  2.873952
##   0.01   10    3.453153  0.6008848  2.824870
##   0.10    1    2.394058  0.7596252  1.894319
##   0.10    2    2.618767  0.7152952  2.117662
##   0.10    3    2.580239  0.7353788  2.005527
##   0.10    4    2.442308  0.7777448  1.907659
##   0.10    5    2.617403  0.7227439  2.059628
##   0.10    6    2.543814  0.7549811  2.060699
##   0.10    7    2.811804  0.6959408  2.149668
##   0.10    8    2.900555  0.6937553  2.336332
##   0.10    9    3.101142  0.6579730  2.481152
##   0.10   10    2.973902  0.6608165  2.360836
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4 and decay = 0.
nnetPred <- predict(nnetTune, newdata = testData$x)
postResample(pred = nnetPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.4700280 0.7762282 1.9078363

Tuning a Multivariate Adaptive Regression Splines (MARS) model

Create a MARS grid, assigning the degrees and number of pruning parameters, then use it to tune a model.

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned <- train(trainingData$x, 
                   trainingData$y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = ctrl)
summary(marsTuned)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=13)
## 
##                                 coefficients
## (Intercept)                        22.050690
## h(0.621722-X1)                    -15.001651
## h(X1-0.621722)                     10.878737
## h(0.601063-X2)                    -18.830135
## h(0.447442-X3)                      9.940077
## h(X3-0.606015)                     12.999390
## h(0.734892-X4)                     -9.877554
## h(X4-0.734892)                     10.414930
## h(0.850094-X5)                     -5.604897
## h(X1-0.621722) * h(X2-0.295997)   -43.245766
## h(0.649253-X1) * h(0.601063-X2)    26.218297
## 
## Selected 11 of 18 terms, and 5 of 10 predictors (nprune=13)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 8 2
## GCV 1.747495    RSS 264.5358    GRSq 0.929051    RSq 0.9457576

Examine the importance of the variables in the model.

marsPred <- predict(marsTuned, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

Tuning a Support Vector Machine (SVM) model

svmTuned <- train(trainingData$x,
                  trainingData$y,
                  method = "svmRadial",
                  preProcess = c("center","scale"),
                  tuneLength = 14,
                  trControl = ctrl)

svmTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.506264  0.8036998  1.986174
##      0.50  2.279290  0.8140317  1.786308
##      1.00  2.116618  0.8349158  1.659155
##      2.00  1.997289  0.8496435  1.547397
##      4.00  1.923603  0.8598428  1.494257
##      8.00  1.864844  0.8675510  1.468198
##     16.00  1.864796  0.8672641  1.472051
##     32.00  1.864322  0.8673259  1.471146
##     64.00  1.864322  0.8673259  1.471146
##    128.00  1.864322  0.8673259  1.471146
##    256.00  1.864322  0.8673259  1.471146
##    512.00  1.864322  0.8673259  1.471146
##   1024.00  1.864322  0.8673259  1.471146
##   2048.00  1.864322  0.8673259  1.471146
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0627191
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0627191 and C = 32.
svmTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 32 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0627190975351777 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -73.9918 
## Training error : 0.008495
svmPred <- predict(svmTuned, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.0731179 0.8257494 1.5747339

(c) Evaluate the models

Which models appear to give the best performance?

The MARS model gives the best performance based on having the lowest RMSE and the highest Rsquared.

rbind(
  "knn" = postResample(pred = knnPred, obs = testData$y),
  "net" = postResample(pred = nnetPred, obs = testData$y),
  "svm" = postResample(pred = svmPred, obs = testData$y),
  "mars" = postResample(pred = marsPred, obs = testData$y)
)
##          RMSE  Rsquared      MAE
## knn  3.204059 0.6819919 2.568346
## net  2.470028 0.7762282 1.907836
## svm  2.073118 0.8257494 1.574734
## mars 1.280306 0.9335241 1.016867

Does MARS select the informative predictors (those named X1–X5)?

MARS does select the informative predictors named X1–X5 although X3 has an importance very close to zero.

varImp(marsTuned)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.33
## X2   48.88
## X5   15.63
## X3    0.00

Exercise 7.5

(0) Chemical manufacturing revisited

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

First acquire the data.

data("ChemicalManufacturingProcess")
chem_mfg <- ChemicalManufacturingProcess

Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter.

Transform, impute missing values, center and scale the data, then check for low frequency values and remove them. Only one variable is removed

set.seed(404)

preProc <- preProcess(chem_mfg, method = c("BoxCox","knnImpute","center","scale"))

chem_pred <- predict(preProc,chem_mfg)

# identify predictors with low frequencies
lowFreqPredictors <- nearZeroVar(chem_pred)

# remove the above set from the data and store the result in a dataframe
chem_pred <- chem_pred[,-lowFreqPredictors]

trainingRows <- sample(c(rep(0, 0.75 * nrow(chem_pred)), 
                  rep(1, 0.25 * nrow(chem_pred))))

chem_train <- chem_pred[trainingRows == 0,]
chem_test <- chem_pred[trainingRows == 1,]

With the data split, train each of the model types from the chapter, first KNN is trained.

trainX <- chem_train %>% select(-Yield) 
trainY <- chem_train$Yield

testX <- chem_test %>% select(-Yield)
testY <- chem_test$Yield

knnModel2 <- train(x = trainX,
                  y = trainY,
                  method = "knn",
                  preProcess = c("center", "scale"),
                  tuneLength = 10
                  )

knnPred2 <- predict(knnModel2, newdata = testX)
knnModel2
## k-Nearest Neighbors 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    5  0.7849372  0.3512194  0.6254559
##    7  0.7686649  0.3681700  0.6220105
##    9  0.7665498  0.3708665  0.6250032
##   11  0.7777691  0.3562217  0.6417412
##   13  0.7823360  0.3568635  0.6446992
##   15  0.7835197  0.3631654  0.6498450
##   17  0.7830489  0.3712792  0.6512016
##   19  0.7833626  0.3758749  0.6528065
##   21  0.7855431  0.3789807  0.6541842
##   23  0.7851436  0.3860551  0.6546798
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

Next Neural Network is trained.

tooHigh2 <- findCorrelation(cor(trainX), cutoff = 0.75)

trainXnnet <- trainX[,-tooHigh2]
  
nnetGrid2 <- expand.grid(decay = c(0.0, 0.01, 0.1),
                        size = c(1:6)
                        )

set.seed(101)
nnetTune2 <- train(trainX,
                  trainY,
                  method = "nnet",
                  tuneGrid = nnetGrid2,
                  trControl = ctrl,
                  preProc = c("center","scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 10 * (ncol(trainXnnet) + 1) + 10 + 1,
                  maxit = 500
                  )

nnetTune2
## Neural Network 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 120, 119, 118, 119, 118, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared   MAE      
##   0.00   1     0.9344072  0.3326020  0.7414160
##   0.00   2     0.8590479  0.3602825  0.6682005
##   0.00   3     0.9311655  0.3109729  0.7343087
##   0.00   4     1.0177424  0.3439668  0.8293547
##   0.00   5     0.9824927  0.3485994  0.7932048
##   0.00   6     0.9228106  0.3484658  0.7276560
##   0.01   1     0.9008567  0.3092818  0.6912542
##   0.01   2     1.1299205  0.2780768  0.8949143
##   0.01   3     1.0227262  0.3137698  0.8256810
##   0.01   4     0.8114177  0.4690071  0.6618649
##   0.01   5     0.8006894  0.4558987  0.6284714
##   0.01   6     0.7503117  0.5000343  0.6063890
##   0.10   1     0.7492391  0.4607657  0.5871559
##   0.10   2     0.7695902  0.4989661  0.6057562
##   0.10   3     0.8649446  0.4472045  0.6714006
##   0.10   4     0.8129531  0.4691542  0.6418103
##   0.10   5     0.7667720  0.4785186  0.6163900
##   0.10   6     0.6703694  0.5724950  0.5478685
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 6 and decay = 0.1.

Next MARS is trained.

marsGrid2 <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsTuned2 <- train(trainX, 
                   trainY,
                   method = "earth",
                   tuneGrid = marsGrid2,
                   trControl = ctrl)
summary(marsTuned2)
## Call: earth(x=data.frame[132,56], y=c(-1.216,1.21,1...), keepxy=TRUE, degree=1,
##             nprune=3)
## 
##                                     coefficients
## (Intercept)                           -0.5194954
## h(0.998218-ManufacturingProcess11)    -0.3726384
## h(ManufacturingProcess32- -1.21828)    0.7018203
## 
## Selected 3 of 21 terms, and 2 of 56 predictors (nprune=3)
## Termination condition: RSq changed by less than 0.001 at 21 terms
## Importance: ManufacturingProcess32, ManufacturingProcess11, ...
## Number of terms at each degree of interaction: 1 2 (additive model)
## GCV 0.4063841    RSS 49.65583    GRSq 0.5296448    RSq 0.5579302

Last SVM is trained.

svmTuned2 <- train(trainX,
                  trainY,
                  method = "svmRadial",
                  preProcess = c("center","scale"),
                  tuneLength = 10,
                  trControl = ctrl)

svmTuned2
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 132 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 119, 119, 118, 119, 119, 120, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE      
##     0.25  0.7057397  0.5318341  0.5954018
##     0.50  0.6602977  0.5603907  0.5507888
##     1.00  0.6128876  0.6116305  0.5045976
##     2.00  0.5874948  0.6367443  0.4753599
##     4.00  0.5758092  0.6468031  0.4714576
##     8.00  0.5708086  0.6508088  0.4719308
##    16.00  0.5666914  0.6557679  0.4692121
##    32.00  0.5666914  0.6557679  0.4692121
##    64.00  0.5666914  0.6557679  0.4692121
##   128.00  0.5666914  0.6557679  0.4692121
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01166393
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01166393 and C = 16.

What is the optimal value of the performance metric?

Overall the optimal value of the performance metric is \(R^2 = 0.9980564\) in the Neural Network model.

(a) Assess the process

Which nonlinear regression model gives the optimal resampling and test set performance?

The Neural Net model gives the best performance based on having the lowest RMSE and the highest Rsquared.

rbind(
  "nnet" = postResample(pred = predict(nnetTune2), obs = trainY),
  "knn" = postResample(pred = predict(knnModel2), obs = trainY),
  "mars" = postResample(pred = predict(marsTuned2), obs = trainY),
  "svm" = postResample(pred = predict(svmTuned2), obs = trainY)
)
##            RMSE  Rsquared        MAE
## nnet 0.04288538 0.9980564 0.03299418
## knn  0.62525567 0.5753418 0.51608229
## mars 0.61333555 0.5579302 0.49940287
## svm  0.08804417 0.9934878 0.08580215

(b) Assess the predictors

Which predictors are most important in the optimal nonlinear regression model?

In the SVM model the top five most important predictors are ManufacturingProcesses 32 and 36, and BiologicalMaterial 06, 03 and 01.

varImp(svmTuned2)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     68.15
## BiologicalMaterial03     57.69
## ManufacturingProcess36   55.55
## BiologicalMaterial01     50.58
## ManufacturingProcess31   50.07
## BiologicalMaterial02     49.51
## ManufacturingProcess29   46.93
## BiologicalMaterial04     45.54
## ManufacturingProcess33   45.48
## ManufacturingProcess13   45.39
## BiologicalMaterial12     44.54
## ManufacturingProcess02   43.34
## ManufacturingProcess09   40.54
## ManufacturingProcess06   36.00
## BiologicalMaterial08     31.46
## ManufacturingProcess17   29.54
## ManufacturingProcess26   29.03
## BiologicalMaterial11     28.94
## ManufacturingProcess11   28.50

Do either the biological or process variables dominate the list?

With only 8 biological variables but 12 process variables, the process variables edge out the biological ones for dominance.

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Looking at the top 20 predictors from the other models, the Neural Net model’s top 20 variables show only 5 biological and 15 process variables that dominate the list.

varImp(nnetTune2)
## nnet variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess23  100.00
## ManufacturingProcess32   96.43
## ManufacturingProcess34   92.81
## ManufacturingProcess33   84.75
## BiologicalMaterial09     82.03
## ManufacturingProcess01   80.49
## ManufacturingProcess03   80.04
## BiologicalMaterial11     78.73
## ManufacturingProcess28   76.03
## ManufacturingProcess45   73.56
## ManufacturingProcess36   72.53
## ManufacturingProcess06   71.24
## BiologicalMaterial12     70.66
## ManufacturingProcess17   70.64
## ManufacturingProcess37   70.00
## ManufacturingProcess41   69.68
## ManufacturingProcess08   69.22
## BiologicalMaterial10     68.23
## BiologicalMaterial04     67.77
## ManufacturingProcess22   67.55

The MARS model shows 2 process variables dominating the top slots.

varImp(marsTuned2)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32     100
## ManufacturingProcess11       0

The KNN model matches the SVM model with 8 biological and 12 process variables dominating the list.

varImp(knnModel2)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## BiologicalMaterial06     68.15
## BiologicalMaterial03     57.69
## ManufacturingProcess36   55.55
## BiologicalMaterial01     50.58
## ManufacturingProcess31   50.07
## BiologicalMaterial02     49.51
## ManufacturingProcess29   46.93
## BiologicalMaterial04     45.54
## ManufacturingProcess33   45.48
## ManufacturingProcess13   45.39
## BiologicalMaterial12     44.54
## ManufacturingProcess02   43.34
## ManufacturingProcess09   40.54
## ManufacturingProcess06   36.00
## BiologicalMaterial08     31.46
## ManufacturingProcess17   29.54
## ManufacturingProcess26   29.03
## BiologicalMaterial11     28.94
## ManufacturingProcess11   28.50

(c) Assess the results

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Plot the top 10 predictors for the Neural Network model as identified above. Then make a correlation plot to Yield.

The correlation plot shows that ManufacturingProcesses 32, 34, 33 and 28 appear to have positive correlations with Yield, while ManufacturingProcesses 23, 01, 03 and 45 appear to have a negative correlation with Yield. That said, both BiologicalMaterials 09 and 11 in the top 10 predictors appear to have a positive correlation with Yield.

top10X <- chem_pred %>%
  select(c("ManufacturingProcess23","ManufacturingProcess32","ManufacturingProcess34","ManufacturingProcess33","BiologicalMaterial09","ManufacturingProcess01","ManufacturingProcess03","BiologicalMaterial11","ManufacturingProcess28","ManufacturingProcess45","Yield"))

correlation <- cor(top10X)

corrplot(correlation)