Exercises from Chapter 7 of textbook Applied Predictive Modeling by Kuhn & Johnson

Exercise 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\(y = 10 sin(\pi x_1x_2) + 20(x_3 − 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2)\)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(mlbench)
set.seed(200)
trainingData = mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x = data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.
plot(trainingData$y, main = "200 Computed Values using Uniformly Distributed Seed Inputs",
     ylab = "Computed Value")

This question uses an interesting approach, calculating 200 values like the ones plotted above, based on a non-linear function of uniformly distributed “predictors”, and then training models on those 200 values. But then the models are evaluated against a much larger test set, in order to remove the chance outcomes that arise with smaller test sets. The main chance/variance now arises from the relatively small training set, where 10-fold CV, for example, will be averaging fits on 180 values against batches of 20 values.

set.seed(200)
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData = mlbench.friedman1(5000, sd = 1)
testData$x = data.frame(testData$x)

(a) Tune several models on these data.

k-Nearest Neighbors

# alias to simplify typing
trD = trainingData
teD = testData

Provided example:

set.seed(200)
knnModel <- train(x = trD$x,
y = trD$y,
method = "knn",
preProc = c("center", "scale"), 
tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.654912  0.4779838  2.958475
##    7  3.529432  0.5118581  2.861742
##    9  3.446330  0.5425096  2.780756
##   11  3.378049  0.5723793  2.719410
##   13  3.332339  0.5953773  2.692863
##   15  3.309235  0.6111389  2.663046
##   17  3.317408  0.6201421  2.678898
##   19  3.311667  0.6333800  2.682098
##   21  3.316340  0.6407537  2.688887
##   23  3.326040  0.6491480  2.705915
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
set.seed(200)
knnPred = predict(knnModel, new_data = teD$x)
postResample(pred = knnPred, obs = teD$y)
##     RMSE Rsquared      MAE 
## 5.621993       NA 4.546538

Seeing this example code that normalizes the data makes me wonder if normalizing data really helps when they are uniformly distributed between 0 and 1. It seems like it makes the IQR values (0.25 to 0.75) harder to differentiate, by crunching them together near zero, without reducing the spread or outliers, and in fact increasing their distance (old range: 0 to 1; new range: -1 to 1). Let’s see –>

set.seed(200)
knnNonNorm <- train(x = trD$x,
y = trD$y,
method = "knn", 
tuneLength = 10)

knnNonNorm
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.693100  0.4670964  2.984478
##    7  3.588241  0.4932495  2.900074
##    9  3.483328  0.5316156  2.821533
##   11  3.415450  0.5616735  2.761507
##   13  3.380127  0.5834403  2.738470
##   15  3.355314  0.5996582  2.716491
##   17  3.370422  0.6044661  2.732379
##   19  3.366822  0.6189193  2.736500
##   21  3.380526  0.6227753  2.749054
##   23  3.379533  0.6334530  2.754868
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
set.seed(200)
knnNonNormPred = predict(knnNonNorm, new_data = teD$x)
postResample(pred = knnNonNormPred, obs = teD$y)
##     RMSE Rsquared      MAE 
## 5.577570       NA 4.508264

The non-normalized data model did better, slightly, in terms of the errors, and in terms of the \(R^2\) during training.
It chose the same number of kNN as the normalized one.

Incidentally, it doesn’t matter if you just center the data (subtract 0.5 from all predictors), you get the exact same models. This is because the Euclidean distances between observations don’t change at all. I’m sure for some model types, normalized data work better than uniform data, but I don’t think that’s the case for kNN.

Neural Network

set.seed(200)
nnetGrid = expand.grid(.decay = c(0.05, 0.1, .15, .2),
                       .size = 8,
                       .bag = F)
nnetTune = train(trD$x, trD$y, method = "avNNet", tuneGrid = nnetGrid,
                 linout = T, trace = F, maxit = 500, maxNWts = 100)
## Warning: executing %dopar% sequentially: no parallel backend registered
plot(nnetTune, main = "Bootstrapped Training RMSE for 8 Hidden Units with Varying Decays")

Using a weight decay of 0.15 –>

set.seed(200)
nnetAvg = avNNet(trD$x, trD$y, size = 8, decay = 0.15, repeats = 5, linout = T,
                 trace = F, maxit = 500, maxNWts = 100)
nnetPreds = predict(nnetAvg, newdata = teD$x)
nnetResults = data.frame(obs = teD$y, pred = nnetPreds)
defaultSummary(nnetResults)
##      RMSE  Rsquared       MAE 
## 1.5612987 0.9025872 1.2090830

These are much better results than the KNN model. For the record, I tried 4, 8, 12, and 16 hidden units, and 8 was best.
And also maybe worth mentioning, I didn’t normalize the uniform data.

Multivariate Adaptive Regression Splines

library(earth)
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
set.seed(200)
plainMars = earth(trD$x, trD$y)
summary(plainMars)
## Call: earth(x=trD$x, y=trD$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982

The informative predictors were indeed recognized by the model. In order: X1, X4, X2, X5, X3.

plainMarsPreds = predict(plainMars, newdata = teD$x)
plainMarsResults = data.frame(obs = teD$y, pred = plainMarsPreds[,])
defaultSummary(plainMarsResults)
##      RMSE  Rsquared       MAE 
## 1.8108045 0.8690992 1.3840843

Let’s see if tuning a couple of parameters will improve those results.

set.seed(200)
marsGrid = expand.grid(.degree = 1:2, .nprune = 5:20)
marsTuned = train(trD$x, trD$y, method = "earth", tuneGrid = marsGrid,
                  newvar.penalty = 0.5,
                  trControl = trainControl(method = "cv", number = 10))
marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        5      2.402099  0.7568218  1.9457323
##   1        6      2.289366  0.7762213  1.8514384
##   1        7      1.730346  0.8670812  1.3584142
##   1        8      1.671535  0.8753821  1.3215500
##   1        9      1.603034  0.8845449  1.2847966
##   1       10      1.620698  0.8820821  1.2989191
##   1       11      1.588064  0.8879146  1.2627053
##   1       12      1.577311  0.8890065  1.2494447
##   1       13      1.578157  0.8891578  1.2596942
##   1       14      1.579869  0.8895682  1.2546183
##   1       15      1.579869  0.8895682  1.2546183
##   1       16      1.579869  0.8895682  1.2546183
##   1       17      1.579869  0.8895682  1.2546183
##   1       18      1.579869  0.8895682  1.2546183
##   1       19      1.579869  0.8895682  1.2546183
##   1       20      1.579869  0.8895682  1.2546183
##   2        5      2.353031  0.7675326  1.8660663
##   2        6      2.215402  0.7888283  1.7377388
##   2        7      1.780715  0.8582379  1.3800123
##   2        8      1.666597  0.8813352  1.3063166
##   2        9      1.554979  0.8985663  1.2114735
##   2       10      1.426109  0.9165576  1.1509405
##   2       11      1.355926  0.9215169  1.0974748
##   2       12      1.254201  0.9329189  1.0259035
##   2       13      1.245268  0.9356153  1.0049496
##   2       14      1.232472  0.9377076  0.9963319
##   2       15      1.235843  0.9370977  1.0032840
##   2       16      1.205527  0.9403988  0.9839347
##   2       17      1.205527  0.9403988  0.9839347
##   2       18      1.205527  0.9403988  0.9839347
##   2       19      1.205527  0.9403988  0.9839347
##   2       20      1.205527  0.9403988  0.9839347
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 16 and degree = 2.
varImp(marsTuned)$importance
##      Overall
## X1 100.00000
## X4  74.84542
## X2  48.59694
## X5  15.43919
## X3   0.00000

This tuned version uses just X1-X5 and their combinations (it did pick up on the X1*X2 interaction):

summary(marsTuned)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=2,
##             nprune=16, newvar.penalty=0.5)
## 
##                                 coefficients
## (Intercept)                        22.721354
## h(0.655098-X1)                    -15.638291
## h(X1-0.655098)                      9.284136
## h(0.647925-X2)                    -17.516017
## h(X2-0.647925)                      9.858733
## h(0.447442-X3)                      9.726195
## h(X3-0.606015)                     12.604512
## h(0.734892-X4)                     -9.776458
## h(X4-0.734892)                      9.960955
## h(0.850094-X5)                     -5.366065
## h(0.218266-X1) * h(X2-0.647925)   -57.662217
## h(X1-0.218266) * h(X2-0.647925)   -31.931918
## h(X1-0.655098) * h(X2-0.321448)   -38.012887
## h(0.671787-X1) * h(0.647925-X2)    24.027065
## 
## Selected 14 of 19 terms, and 5 of 10 predictors (nprune=16)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 4
## GCV 1.595457    RSS 221.1483    GRSq 0.9352238    RSq 0.9546541

It looks like it may do slightly better than the default MARS model from before (.869).

newMarsPreds = predict(marsTuned, newdata = teD$x)
newMarsResults = data.frame(obs = teD$y, pred = newMarsPreds[,])
defaultSummary(newMarsResults)
##      RMSE  Rsquared       MAE 
## 1.1806023 0.9445875 0.9362056

That is quite a bit better than even the nnet model.

1.18 RMSE here, vs 1.81 in the default MARS and 1.56 in the nnet.
.945 \(R^2\) here vs .869 in the default MARS and .903 in the nnet.

Upon reviewing the help page for earth, it appears that by trying different newvar.penalty values, and going with the one that worked best, which was the highest one I tried, I made it preferable for the model to use different combinations of already chosen features rather than adding new candidate ones. As design of this dataset would have it, X1 and X2 are the most important ones, and the ones after X5 don’t even matter. So assuming the model considered the features in order, this matter of fate happened to work in its favor, and perhaps these results shouldn’t be so high.

I don’t know whether the model searches left to right in the matrix, or by alphabetical order, so I’ll try changing both to see if the results were due to lucky ordering of features considered. First the left-to-right ordering:

set.seed(200)
neworder = sample(1:10)
shufTrain = trD$x[,neworder]
shufTest = teD$x[,neworder]
shufTuned = train(shufTrain, trD$y, method = "earth", tuneGrid = marsGrid,
                  newvar.penalty = 0.5,
                  trControl = trainControl(method = "cv", number = 10))
varImp(shufTuned)$importance
##      Overall
## X1 100.00000
## X4  74.84542
## X2  48.59694
## X5  15.43919
## X3   0.00000
head(shufTrain, 2)
##          X6        X2        X7        X8        X4       X10        X1
## 1 0.3617906 0.6478064 0.8266609 0.4214081 0.1815996 0.5886216 0.5337724
## 2 0.4530593 0.4381528 0.6489601 0.8446239 0.6692491 0.7584008 0.5837650
##          X9        X3        X5
## 1 0.5911144 0.8507853 0.9290398
## 2 0.9281931 0.6727266 0.1637978

Well, changing the left-right order of the columns didn’t affect anything, but maybe the model is going by alphabetical order, so I’ll try renaming features in reverse.

renameTrain = trD$x
renameTest = teD$x
names(renameTrain) = LETTERS[10:1]
names(renameTest) = names(renameTrain)
renameTuned = train(renameTrain, trD$y, method = "earth", tuneGrid = marsGrid,
                  newvar.penalty = 0.5,
                  trControl = trainControl(method = "cv", number = 10))
varImp(renameTuned)$importance
##     Overall
## J 100.00000
## G  74.89804
## I  48.68261
## F  15.50199
## H   0.00000

That didn’t change things either, so it looks like the high new.penalty parameter effect wasn’t due to chance.

Support Vector Machines

library(kernlab)
## Warning: package 'kernlab' was built under R version 4.0.5
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
set.seed(200)

svmRTuned = train(trD$x, trD$y, 
                 method = 'svmRadial',
                 tuneLength = 14, trControl = trainControl(method = 'cv'))

svmLTuned = train(trD$x, trD$y, 
                 method = 'svmLinear',
                 tuneLength = 14, trControl = trainControl(method = 'cv'))

polyGrid = expand.grid(.degree = 2:4, .scale = c(.05, .1, .15), .C = c(.25, .5, 1))
svmPTuned = train(trD$x, trD$y, 
                 method = 'svmPoly',
                 tuneGrid = polyGrid, trControl = trainControl(method = 'cv'))
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0629932410345397 
## 
## Number of Support Vectors : 152 
## 
## Objective Function Value : -72.63 
## Training error : 0.009178
svmLTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 172 
## 
## Objective Function Value : -54.9759 
## Training error : 0.216921
svmPTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 0.5 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  4  scale =  0.05  offset =  1 
## 
## Number of Support Vectors : 144 
## 
## Objective Function Value : -11.5111 
## Training error : 0.029197
radialPreds = predict(svmRTuned, newdata = teD$x)
radialResults = data.frame(obs = teD$y, pred = radialPreds)
defaultSummary(radialResults)
##      RMSE  Rsquared       MAE 
## 2.0870811 0.8253497 1.5886934
linearPreds = predict(svmLTuned, newdata = teD$x)
linearResults = data.frame(obs = teD$y, pred = linearPreds)
defaultSummary(linearResults)
##      RMSE  Rsquared       MAE 
## 2.7610923 0.6997975 2.0815518
polyPreds = predict(svmPTuned, newdata = teD$x)
polyResults = data.frame(obs = teD$y, pred = polyPreds)
defaultSummary(polyResults)
##     RMSE Rsquared      MAE 
## 2.093961 0.824972 1.574803

The RBF method (top) and polynomial method (bottom), produced similar results, where the linear method lagged.
Overall, svm results weren’t as effective for this dataset as MARS and NeuralNets, but did better than kNN.

(b) Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

The best of the models I tried was the MARS model with a high new.variable penalty. Second best was a neural network with 8 hidden units and a high weight decay.

MARS did in fact find the important “predictors”.


Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# remove one feature that has near-zero variance
prepro = ChemicalManufacturingProcess %>% select(-c(BiologicalMaterial07))
# turn the 0's in row 108 in NA's, then impute medians for all NA's in dataset
prepro[108, c(37:39,41:43)] = NA   #MP's 25-27, 29-31
for (feat in names(prepro)) {
  med = median(prepro[,feat], na.rm = T)
  prepro[feat][is.na(prepro[feat])] = med
}
# shorten the feature names
shorten = function(n) paste0(substr(n, 1, 1), substr(n, nchar(n)-1, nchar(n)))
names(prepro) = sapply(names(prepro), shorten)

Split the 176 rows into 136 train / 40 test.

set.seed(624)
trains = sample(1:176, 136)  # randomly choose 136 and test with the other 40
tests = setdiff(1:176, trains)
trainDF = prepro[trains,]
testDF = prepro[tests,]
trainX = trainDF %>% select(-c(Yld))
trainY = trainDF$Yld
testX = testDF %>% select(-c(Yld))
testY = testDF$Yld

min-max scale the values. Use the ranges of the training set, so as not to unrealistically steal info from the test set.

for (feat in names(trainX)){
  hi = max(trainX[,feat])
  lo = min(trainX[,feat])
  trainX[,feat] = (trainX[,feat] - lo) / (hi - lo)
  testX[,feat] = (testX[,feat] - lo) / (hi - lo)
}
featurePlot(trainX[, 1:28], trainY)

featurePlot(trainX[, 29:56], trainY)

The positive correlation between the Bio features and the response are evident here. Similarly, some of the Mfg features show clear correlations, both positive (M32) and negative (M36). Also, because of several features with lone 0’s (M20), these models will benefit from normalizing the features.

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

set.seed(624)
knnTune = train(trainX, trainY, method = "knn",
                preProc = c("center", "scale"), 
                tuneGrid = data.frame(.k = seq(1,15,2)),
                trControl = trainControl(method = "cv"))
knnTune
## k-Nearest Neighbors 
## 
## 136 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 124, 121, 121, 124, 123, 122, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE      
##    1  1.416087  0.5025799  1.1124934
##    3  1.210293  0.6047356  0.9290389
##    5  1.249059  0.5953135  1.0020743
##    7  1.277930  0.5805682  1.0546478
##    9  1.308584  0.5623801  1.0821914
##   11  1.332110  0.5503524  1.0958409
##   13  1.367348  0.5223980  1.1200447
##   15  1.369058  0.5301305  1.1275520
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
knnNormPred = predict(knnTune, newdata = testX)
knnNormResults = data.frame(obs = testY, pred = knnNormPred)
defaultSummary(knnNormResults)
##      RMSE  Rsquared       MAE 
## 1.2446672 0.4532793 1.0724167

kNN with k=3 and the features normalized produces a baseline RMSE of 1.245 and an \(R^2\) of .453. Hopefully this won’t be the optimal model, since it doesn’t get to the heart of the question about which predictors have the most effect on the response. For that, maybe MARS will help here.

set.seed(624)
marsGrid = expand.grid(.degree = 1:2, .nprune = 5:20)
marsTuned = train(trainX, trainY, method = "earth", tuneGrid = marsGrid,
                  newvar.penalty = 0.07, preProcess = c("center", "scale"),
                  trControl = trainControl(method = "cv"))
summary(marsTuned)
## Call: earth(x=data.frame[136,56], y=c(43.38,40.31,3...), keepxy=TRUE, degree=1,
##             nprune=15, newvar.penalty=0.07)
## 
##                  coefficients
## (Intercept)         39.818667
## h(1.3001-B06)       -0.535678
## h(-1.09499-M09)     -1.003246
## h(-1.36465-M13)      3.860237
## h(M22- -1.34292)    -0.205026
## h(0.76318-M28)       0.431782
## h(1.15187-M30)      -0.518481
## h(-1.16237-M32)     -2.016012
## h(M32- -1.16237)     1.606808
## h(M32-1.01059)      -1.443762
## h(-0.960163-M33)     2.156659
## h(0.218267-M39)     -0.232838
## h(M39-0.218267)     -6.938517
## h(M42-0.295014)     -6.846153
## 
## Selected 14 of 22 terms, and 10 of 56 predictors (nprune=15)
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: M32, M30, M13, M39, M09, B06, M28, M33, M22, M42, B01-unused, ...
## Number of terms at each degree of interaction: 1 13 (additive model)
## GCV 1.01665    RSS 88.81486    GRSq 0.7196    RSq 0.8172054
marsPreds = predict(marsTuned, newdata = testX)
marsResults = data.frame(obs = testY, pred = marsPreds[,])
defaultSummary(marsResults)
##      RMSE  Rsquared       MAE 
## 1.1554675 0.5355233 0.8798888

That .536 is quite a dropoff from the cross-validation \(R^2\), and not as good as a simple Linear Regression model (.596).

set.seed(624)

svmRadTuned = train(trainX, trainY, 
                 method = 'svmRadial', preProcess = c("center", "scale"),
                 tuneLength = 14, trControl = trainControl(method = 'cv'))

polyGrid = expand.grid(.degree = 2:4, .scale = c(.05, .1, .15), .C = c(.25, .5, 1))
svmPolyTuned = train(trainX, trainY,
                 method = 'svmPoly', preProcess = c("center", "scale"),
                 tuneGrid = polyGrid, trControl = trainControl(method = 'cv'))
svmRadTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 4 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0139503949307737 
## 
## Number of Support Vectors : 118 
## 
## Objective Function Value : -75.1875 
## Training error : 0.031368
radPreds = predict(svmRadTuned, newdata = testX)
radResults = data.frame(obs = testY, pred = radPreds)
defaultSummary(radResults)
##      RMSE  Rsquared       MAE 
## 0.9821407 0.6717444 0.7670203

The RBF model did quite a bit better than all other models have, with an \(R^2\) of .676.

svmPolyTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 0.25 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  2  scale =  0.15  offset =  1 
## 
## Number of Support Vectors : 129 
## 
## Objective Function Value : -1.2469 
## Training error : 0.009761
polPreds = predict(svmPolyTuned, newdata = testX)
polResults = data.frame(obs = testY, pred = polPreds)
defaultSummary(polResults)
##      RMSE  Rsquared       MAE 
## 5.3052920 0.0406753 1.7460609

Not nearly as good as the RBF SVM.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The optimal model was the SVM with RBF. Here are the top-20 predictor importances:

varImp(svmRadTuned)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##     Overall
## M13  100.00
## M32   96.49
## M17   88.25
## B06   83.63
## M09   79.40
## B03   70.95
## M36   68.71
## M31   64.92
## B02   62.62
## B12   59.61
## M06   58.95
## M33   50.11
## B04   48.81
## M30   47.95
## M11   46.60
## B11   42.04
## M29   40.86
## B08   40.66
## B01   38.30
## B09   38.26

Compared to the best linear model’s feature importance, determined by coefficient magnitude:

linfeats = c('B02', 'B03', 'M04', 'M09', 'M17', 'M27', 'M29', 'M32', 'M33', 'M43')
linvals = c(-3.614, 3.647, 2.556, 2.401, -3.071, -4.984, 7.204, 8.088, -4.209, 2.063)    
featframe = data.frame(Feats = linfeats, Coef = linvals)
featframe[order(-abs(featframe$Coef)),]
##    Feats   Coef
## 8    M32  8.088
## 7    M29  7.204
## 6    M27 -4.984
## 9    M33 -4.209
## 2    B03  3.647
## 1    B02 -3.614
## 5    M17 -3.071
## 3    M04  2.556
## 4    M09  2.401
## 10   M43  2.063

Interestingly, the Bio Features are more important in the svm model than they were in the linear models. The Bio Feats were for the most part correlated with the response, so it makes sense that they’re important to these models, but there were other Manufacturing features that were highly correlated with several of the Bio Feats, which left the door open for the model to remove one of them.

M13 was the most important for svm but didn’t register as much for the linear model. It was highly correlated with M09 and M17, both of which were top ten in both models. M09, M32, B02, and B03 were top ten for both as well. svm kept B06 and B12, where lm kept M27 and had M29 higher up. That M29 is one of the features that lm used to capture some of the Bio Feats it discarded, as it is fairly correlated with the B06 and B12 that svm leaned on.

library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(prepro), type = "upper", diag = F, ) -> r #to suppress some output

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

As mentioned, the Bio features uniquely used by the svm model are positively correlated with the response, especially B06. This B06, in turn, is highly correlated with features the linear model relied on most, like M29 and M33. It’s hard to know what would be the most effective way to use this knowledge to increase Yield, since we don’t know what the features describe, beyond their broad category. It was mentioned that Bio Features can’t be altered, whereas Manufacturing ones can, so it seems like it might be better to have a top 10 list like the linear model, where more of the features are manufacturing ones. But if the experimenters can just decide to use materials with the Bio Feats that work best, the svm model shows a clearer path to improving Yield, based on its higher predictive power. One big question is why are so many Bio features so strongly correlated with Mfg features? It seems like a few more experimental runs, breaking the dependence of those correlations, might determine whether to focus on improving Yield through selection of Bio feats vs adjustments to Mfg feats.