DATA 624: Assignment 08

Author

Curtis Elsasser

Published

April 13, 2025

Setup

library(AppliedPredictiveModeling)
library(caret)
library(GGally)
library(mlbench)
library(tibble)
library(tidyverse)

set.seed(314159)

Assignment

Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.

Problem 7.2

Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[ y = 10\sin{\pi x_1x_2}+20(x_3−0.5)^2 +10x_4 +5x_5 + N(0,\sigma^2) \]

where the \(x\) values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)

## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## Let's see what it looks like in a line plot.
## It should be sine like.
data.frame(index = 1:100, trainingData$x, y = trainingData$y) |>
  pivot_longer(cols = -index,
    names_to = "predictor",
    values_to = "value") |>
  ggplot(mapping = aes(x = index, y = value, color = predictor)) +
  geom_line() +
  labs(title = "Friedman1 Data: Predictors") +
  facet_wrap(~ predictor, ncol = 2, scale = "free")

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

One can sort of see its sine origins. But it also looks noisy. Let’s see what our non-linear models make of it.

Tune several models on these data.

makePrediction <- function(model) {
  prediction <- predict(model, newdata = testData$x)
  ## The function 'postResample' can be used to get the test set performance values
  estimates <- postResample(pred = prediction, obs = testData$y)
  return(data.frame(
    prediction = prediction,
    RMSE = estimates[1]
  ))
}

KNN

knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)
knnModel

k-Nearest Neighbors 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  3.466085  0.5121775  2.816838
   7  3.349428  0.5452823  2.727410
   9  3.264276  0.5785990  2.660026
  11  3.214216  0.6024244  2.603767
  13  3.196510  0.6176570  2.591935
  15  3.184173  0.6305506  2.577482
  17  3.183130  0.6425367  2.567787
  19  3.198752  0.6483184  2.592683
  21  3.188993  0.6611428  2.588787
  23  3.200458  0.6638353  2.604529

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.

Our KNN champion is:

Method	RMSE	Params
knn	3.183130	`k=17`

knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set performance values
postResample(pred = knnPred, obs = testData$y)

     RMSE  Rsquared       MAE 
3.2040595 0.6819919 2.5683461

SVR

trainSVR <- function(method, tuneLength = 5) {
  # do we want to specify our own control parameters? What is the default?
  # ctrl <- trainControl(method = "cv", number = 5)
  fit <- train(
    x = trainingData$x,
    y = trainingData$y,
    method = method,
    preProcess = c("center", "scale"),
    tuneLength = tuneLength
  )
  print(fit)
  print(fit$bestTune)
  return(fit)
}

svmLModel <- trainSVR("svmLinear", 1)

Support Vector Machines with Linear Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results:

  RMSE      Rsquared   MAE    
  2.551519  0.7476728  2.01746

Tuning parameter 'C' was held constant at a value of 1
  C
1 1

svmRModel <- trainSVR("svmRadial", 10)

Support Vector Machines with Radial Basis Function Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  C       RMSE      Rsquared   MAE     
    0.25  2.525979  0.7804630  2.016014
    0.50  2.293423  0.7960080  1.808878
    1.00  2.156969  0.8112034  1.697751
    2.00  2.081487  0.8226984  1.631759
    4.00  2.050866  0.8270470  1.605581
    8.00  2.046703  0.8280421  1.602148
   16.00  2.046385  0.8281081  1.601593
   32.00  2.046385  0.8281081  1.601593
   64.00  2.046385  0.8281081  1.601593
  128.00  2.046385  0.8281081  1.601593

Tuning parameter 'sigma' was held constant at a value of 0.06529705
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06529705 and C = 16.
       sigma  C
7 0.06529705 16

svmPModel <- trainSVR("svmPoly", 4)

Support Vector Machines with Polynomial Kernel 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  degree  scale  C     RMSE      Rsquared   MAE     
  1       0.001  0.25  4.740884  0.6902993  3.893787
  1       0.001  0.50  4.574385  0.6934968  3.754844
  1       0.001  1.00  4.270128  0.7003019  3.493989
  1       0.001  2.00  3.772729  0.7164046  3.080855
  1       0.010  0.25  3.585626  0.7180459  2.934032
  1       0.010  0.50  2.975363  0.7250589  2.440642
  1       0.010  1.00  2.662080  0.7302735  2.155593
  1       0.010  2.00  2.588294  0.7323073  2.067295
  1       0.100  0.25  2.576129  0.7336613  2.055606
  1       0.100  0.50  2.570671  0.7355963  2.037586
  1       0.100  1.00  2.584292  0.7351146  2.048364
  1       0.100  2.00  2.595063  0.7351536  2.058574
  1       1.000  0.25  2.596904  0.7350848  2.060741
  1       1.000  0.50  2.598265  0.7355652  2.063759
  1       1.000  1.00  2.598357  0.7357710  2.063551
  1       1.000  2.00  2.599307  0.7359000  2.064332
  2       0.001  0.25  4.574377  0.6935558  3.754853
  2       0.001  0.50  4.270092  0.7003607  3.493978
  2       0.001  1.00  3.772638  0.7164730  3.080781
  2       0.001  2.00  3.162075  0.7217320  2.601215
  2       0.010  0.25  2.965569  0.7275450  2.431434
  2       0.010  0.50  2.646295  0.7339781  2.140232
  2       0.010  1.00  2.549960  0.7403652  2.034452
  2       0.010  2.00  2.508628  0.7465640  1.986923
  2       0.100  0.25  2.270246  0.7906128  1.740774
  2       0.100  0.50  2.214283  0.8036707  1.680506
  2       0.100  1.00  2.254140  0.8024618  1.710590
  2       0.100  2.00  2.312914  0.7982985  1.768183
  2       1.000  0.25  2.448406  0.7825888  1.902525
  2       1.000  0.50  2.469659  0.7812319  1.918542
  2       1.000  1.00  2.486838  0.7792002  1.932304
  2       1.000  2.00  2.489980  0.7788210  1.935774
  3       0.001  0.25  4.418001  0.6973432  3.621611
  3       0.001  0.50  4.004192  0.7115996  3.268737
  3       0.001  1.00  3.423407  0.7188881  2.809099
  3       0.001  2.00  2.850229  0.7278810  2.324427
  3       0.010  0.25  2.727309  0.7345626  2.214836
  3       0.010  0.50  2.545523  0.7442909  2.038521
  3       0.010  1.00  2.475986  0.7525542  1.959426
  3       0.010  2.00  2.424417  0.7631450  1.895615
  3       0.100  0.25  2.155248  0.8114591  1.653112
  3       0.100  0.50  2.192186  0.8085547  1.677435
  3       0.100  1.00  2.243233  0.8038043  1.717120
  3       0.100  2.00  2.270305  0.8016215  1.744910
  3       1.000  0.25  3.186492  0.6348075  2.455965
  3       1.000  0.50  3.186492  0.6348075  2.455965
  3       1.000  1.00  3.186492  0.6348075  2.455965
  3       1.000  2.00  3.186492  0.6348075  2.455965

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
   degree scale    C
41      3   0.1 0.25

So, what do we have? We’ve got three different results (linear, radial and polynomial) for SVR. Let’s see who our champion was:

Method	RMSE	Params
svmLinear	2.551519	`C=1`
svmRadial	2.046385	`C=16`, `sigma=0.06529705`
svmPoly	2.155248	`C=0.25`, `degree=3`, `scale=0.1`

Of the three, the radial model has the lowest RMSE.

MARS

marsModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth",
  preProc = c("center", "scale"),
  tuneLength = 10
)

Loading required package: earth

Loading required package: Formula

Loading required package: plotmo

Loading required package: plotrix

print(marsModel)

Multivariate Adaptive Regression Spline 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  nprune  RMSE      Rsquared   MAE     
   2      4.453623  0.2198309  3.682371
   3      3.712197  0.4554337  3.012165
   4      2.795658  0.6874873  2.240726
   6      2.331132  0.7817116  1.855013
   7      1.902458  0.8541971  1.506668
   9      1.753020  0.8782210  1.367227
  10      1.754192  0.8785378  1.359698
  12      1.759946  0.8770912  1.364519
  13      1.765846  0.8764817  1.371836
  15      1.803247  0.8716558  1.393796

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 9 and degree = 1.

print(marsModel$bestTune)

  nprune degree
6      9      1

Method	RMSE	Params
earth	1.777536	`nprune=13`

Neural Network

nnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "nnet",
  preProc = c("center", "scale"),
  trace = FALSE,
  tuneLength = 7
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

print(nnModel)

Neural Network 

200 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
Resampling results across tuning parameters:

  size  decay         RMSE      Rsquared   MAE     
   1    0.0000000000  14.33889        NaN  13.48270
   1    0.0001000000  14.33889  0.4876696  13.48270
   1    0.0003981072  14.33889  0.5207444  13.48270
   1    0.0015848932  14.33889  0.1758176  13.48271
   1    0.0063095734  14.33890  0.5208661  13.48272
   1    0.0251188643  14.33894  0.7160343  13.48275
   1    0.1000000000  14.33906  0.7340019  13.48289
   3    0.0000000000  14.33889        NaN  13.48270
   3    0.0001000000  14.33889  0.2363548  13.48270
   3    0.0003981072  14.33889  0.4152135  13.48270
   3    0.0015848932  14.33889  0.1537913  13.48270
   3    0.0063095734  14.33890  0.6250552  13.48271
   3    0.0251188643  14.33892  0.7306075  13.48274
   3    0.1000000000  14.33900  0.7343296  13.48283
   5    0.0000000000  14.33889        NaN  13.48270
   5    0.0001000000  14.33889  0.2299694  13.48270
   5    0.0003981072  14.33889  0.3656188  13.48270
   5    0.0015848932  14.33889  0.1526061  13.48270
   5    0.0063095734  14.33890  0.5888706  13.48271
   5    0.0251188643  14.33891  0.7280735  13.48273
   5    0.1000000000  14.33898  0.7350663  13.48280
   7    0.0000000000  14.33889        NaN  13.48270
   7    0.0001000000  14.33889  0.2439086  13.48270
   7    0.0003981072  14.33889  0.3016703  13.48270
   7    0.0015848932  14.33889  0.1726897  13.48270
   7    0.0063095734  14.33890  0.5889107  13.48271
   7    0.0251188643  14.33891  0.7248153  13.48273
   7    0.1000000000  14.33897  0.7346983  13.48279
   9    0.0000000000  14.33889        NaN  13.48270
   9    0.0001000000  14.33889  0.2341736  13.48270
   9    0.0003981072  14.33889  0.1673498  13.48270
   9    0.0015848932  14.33889  0.1135118  13.48270
   9    0.0063095734  14.33889  0.5660738  13.48271
   9    0.0251188643  14.33891  0.7199275  13.48272
   9    0.1000000000  14.33896  0.7347103  13.48278
  11    0.0000000000  14.33889        NaN  13.48270
  11    0.0001000000  14.33889  0.1908908  13.48270
  11    0.0003981072  14.33889  0.1669535  13.48270
  11    0.0015848932  14.33889  0.1335503  13.48270
  11    0.0063095734  14.33889  0.5409709  13.48271
  11    0.0251188643  14.33891  0.7146450  13.48272
  11    0.1000000000  14.33895  0.7350194  13.48277
  13    0.0000000000  14.33889        NaN  13.48270
  13    0.0001000000  14.33889  0.1539080  13.48270
  13    0.0003981072  14.33889  0.1471895  13.48270
  13    0.0015848932  14.33889  0.1259564  13.48270
  13    0.0063095734  14.33889  0.6228874  13.48271
  13    0.0251188643  14.33891  0.7124549  13.48272
  13    0.1000000000  14.33895  0.7352840  13.48277

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 1 and decay = 0.

print(nnModel$bestTune)

  size decay
1    1     0

That is surprising. For no good reason, I thought a neural network would be competitive, but it’s not.

Method	RMSE	Params
nnet	14.42650	`size=1`, `decay=0`

Answers

Which models appear to give the best performance?

Putting them all together:

Model	Method	RMSE	Params
KNN	knn	3.183130	`k=17`
SVR	svmLinear	2.551519	`C=1`
SVR	svmRadial	2.046385	`C=16`, `sigma=0.06529705`
SVR	svmPoly	2.155248	`C=0.25`, `degree=3`, `scale=0.1`
MARS	earth	1.777536	`nprune=13`
nnet	14.42650	`size=1`, `decay=0`

The best performance appears to be from the MARS model with an RMSE of 1.777536.

Does MARS select the informative predictors (those named \(X1–X5\))?

varImp(marsModel)

earth variable importance

   Overall
X1  100.00
X4   82.92
X2   64.47
X5   40.67
X3   28.65
X6    0.00

MARS most certainly did. Exclusively.

Cleanup

Let’s get rid of some of the clutter.

rm(trainingData, testData, knnModel, svmLModel, svmRModel, svmPModel, marsModel, nnModel)

Problem 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

data("ChemicalManufacturingProcess")

# Impute missing values with the mean of adjacent values
pp <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
df <- predict(pp, ChemicalManufacturingProcess)

# get rid of near zero variance predictors
nz <- nearZeroVar(df, saveMetrics = FALSE)
df <- df[, -nz]

# Split the data into training and test sets
x <- df[, -1]
y <- df$Yield
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
train_x <- x[train_index, ]
train_y <- y[train_index]
test_x <- x[-train_index, ]
test_y <- y[-train_index]

# Cleanup
rm(pp, df, x, y, train_index)

Now let us setup resampling. We will use cross-validation to evaluate the models.

ctrl <- trainControl(
  method = "cv",
  number = 5,
  returnResamp = "all"
)

Fitting

KNN

knnModel <- train(
  x = train_x,
  y = train_y,
  method = "knn",
  preProc = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 10
)
print(knnModel)

k-Nearest Neighbors 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 99, 99, 99, 100, 99 
Resampling results across tuning parameters:

  k   RMSE       Rsquared   MAE      
   5  0.6895884  0.5118210  0.5630125
   7  0.7213101  0.4668344  0.5880421
   9  0.7148948  0.4940850  0.5865825
  11  0.7101750  0.5071768  0.5774092
  13  0.7106957  0.5150371  0.5768799
  15  0.7159053  0.5206550  0.5750976
  17  0.7243754  0.5232246  0.5889910
  19  0.7374226  0.5155476  0.5946109
  21  0.7406342  0.5199245  0.5947447
  23  0.7512223  0.5097597  0.6024067

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.

print(knnModel$bestTune)

  k
1 5

SVR

trainSVR <- function(method, tuneLength = 5) {
  # do we want to specify our own control parameters? What is the default?
  # ctrl <- trainControl(method = "cv", number = 5)
  fit <- train(
    x = train_x,
    y = train_y,
    method = method,
    preProcess = c("center", "scale"),
    trControl = ctrl,
    tuneLength = tuneLength
  )
  print(fit)
  print(fit$bestTune)
  return(fit)
}

svmLModel <- trainSVR("svmLinear", 1)

Support Vector Machines with Linear Kernel 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 97, 100, 100, 100, 99 
Resampling results:

  RMSE      Rsquared   MAE     
  2.853927  0.3292755  1.127083

Tuning parameter 'C' was held constant at a value of 1
  C
1 1

svmRModel <- trainSVR("svmRadial", 10)

Support Vector Machines with Radial Basis Function Kernel 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 99, 99, 99, 100, 99 
Resampling results across tuning parameters:

  C       RMSE       Rsquared   MAE      
    0.25  0.7738469  0.4669364  0.6315921
    0.50  0.7208345  0.4859653  0.5831042
    1.00  0.6890652  0.5079438  0.5547469
    2.00  0.6635964  0.5312038  0.5399453
    4.00  0.6443627  0.5543532  0.5316313
    8.00  0.6442760  0.5545981  0.5352342
   16.00  0.6459216  0.5530072  0.5382635
   32.00  0.6459216  0.5530072  0.5382635
   64.00  0.6459216  0.5530072  0.5382635
  128.00  0.6459216  0.5530072  0.5382635

Tuning parameter 'sigma' was held constant at a value of 0.01236299
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01236299 and C = 8.
       sigma C
6 0.01236299 8

svmPModel <- trainSVR("svmPoly", 4)

Support Vector Machines with Polynomial Kernel 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 98, 99, 100, 100, 99 
Resampling results across tuning parameters:

  degree  scale  C     RMSE         Rsquared   MAE       
  1       0.001  0.25    0.9106419  0.3849392   0.7281315
  1       0.001  0.50    0.8616691  0.3975147   0.6980332
  1       0.001  1.00    0.8082105  0.4136057   0.6662630
  1       0.001  2.00    0.7611620  0.4199080   0.6313126
  1       0.010  0.25    0.7643470  0.4090222   0.6259356
  1       0.010  0.50    0.7867462  0.4144638   0.6072599
  1       0.010  1.00    0.9152588  0.4023962   0.6389675
  1       0.010  2.00    1.1851424  0.3833684   0.7008715
  1       0.100  0.25    1.3083413  0.3745627   0.7303115
  1       0.100  0.50    1.6255470  0.3617986   0.7934839
  1       0.100  1.00    2.1657717  0.3261827   0.9086161
  1       0.100  2.00    2.6400839  0.2696350   1.0311687
  1       1.000  0.25    2.6536636  0.2623733   1.0482108
  1       1.000  0.50    2.8187630  0.2456215   1.1061679
  1       1.000  1.00    3.0821454  0.2390803   1.1683853
  1       1.000  2.00    3.5834123  0.2470449   1.2751313
  2       0.001  0.25    0.8612681  0.4069234   0.6967042
  2       0.001  0.50    0.8071078  0.4165077   0.6658916
  2       0.001  1.00    0.7570117  0.4367414   0.6246751
  2       0.001  2.00    0.7587181  0.4434109   0.5985194
  2       0.010  0.25    1.0514732  0.3453416   0.6570466
  2       0.010  0.50    1.1891133  0.3540641   0.6707408
  2       0.010  1.00    1.7145890  0.2902642   0.7891574
  2       0.010  2.00    1.9345630  0.2801307   0.8330883
  2       0.100  0.25    4.9842790  0.2587619   1.4734784
  2       0.100  0.50    5.5729776  0.2540084   1.5928851
  2       0.100  1.00    5.5729776  0.2540084   1.5928851
  2       0.100  2.00    5.5729776  0.2540084   1.5928851
  2       1.000  0.25    6.6374691  0.1881592   1.8621289
  2       1.000  0.50    6.6374691  0.1881592   1.8621289
  2       1.000  1.00    6.6374691  0.1881592   1.8621289
  2       1.000  2.00    6.6374691  0.1881592   1.8621289
  3       0.001  0.25    0.8354211  0.3633199   0.6879510
  3       0.001  0.50    0.7714611  0.4341537   0.6400390
  3       0.001  1.00    0.7455086  0.4518126   0.6028351
  3       0.001  2.00    0.8309080  0.3847034   0.6142516
  3       0.010  0.25   15.4474754  0.2870653   3.5569843
  3       0.010  0.50   21.0469213  0.2757707   4.6848753
  3       0.010  1.00   24.2540434  0.2586997   5.3343507
  3       0.010  2.00   31.8667519  0.2595499   6.8544389
  3       0.100  0.25  119.8764181  0.2484157  24.7404900
  3       0.100  0.50  119.8764181  0.2484157  24.7404900
  3       0.100  1.00  119.8764181  0.2484157  24.7404900
  3       0.100  2.00  119.8764181  0.2484157  24.7404900
  3       1.000  0.25   49.9475146  0.2073258  10.8534458
  3       1.000  0.50   49.9475146  0.2073258  10.8534458
  3       1.000  1.00   49.9475146  0.2073258  10.8534458
  3       1.000  2.00   49.9475146  0.2073258  10.8534458

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were degree = 3, scale = 0.001 and C = 1.
   degree scale C
35      3 0.001 1

print(svmLModel)

Support Vector Machines with Linear Kernel 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 97, 100, 100, 100, 99 
Resampling results:

  RMSE      Rsquared   MAE     
  2.853927  0.3292755  1.127083

Tuning parameter 'C' was held constant at a value of 1

print(svmLModel$bestTune)

  C
1 1

print(svmRModel)

Support Vector Machines with Radial Basis Function Kernel 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 99, 99, 99, 100, 99 
Resampling results across tuning parameters:

  C       RMSE       Rsquared   MAE      
    0.25  0.7738469  0.4669364  0.6315921
    0.50  0.7208345  0.4859653  0.5831042
    1.00  0.6890652  0.5079438  0.5547469
    2.00  0.6635964  0.5312038  0.5399453
    4.00  0.6443627  0.5543532  0.5316313
    8.00  0.6442760  0.5545981  0.5352342
   16.00  0.6459216  0.5530072  0.5382635
   32.00  0.6459216  0.5530072  0.5382635
   64.00  0.6459216  0.5530072  0.5382635
  128.00  0.6459216  0.5530072  0.5382635

Tuning parameter 'sigma' was held constant at a value of 0.01236299
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01236299 and C = 8.

print(svmRModel$bestTune)

       sigma C
6 0.01236299 8

print(svmPModel)

Support Vector Machines with Polynomial Kernel 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 98, 99, 100, 100, 99 
Resampling results across tuning parameters:

  degree  scale  C     RMSE         Rsquared   MAE       
  1       0.001  0.25    0.9106419  0.3849392   0.7281315
  1       0.001  0.50    0.8616691  0.3975147   0.6980332
  1       0.001  1.00    0.8082105  0.4136057   0.6662630
  1       0.001  2.00    0.7611620  0.4199080   0.6313126
  1       0.010  0.25    0.7643470  0.4090222   0.6259356
  1       0.010  0.50    0.7867462  0.4144638   0.6072599
  1       0.010  1.00    0.9152588  0.4023962   0.6389675
  1       0.010  2.00    1.1851424  0.3833684   0.7008715
  1       0.100  0.25    1.3083413  0.3745627   0.7303115
  1       0.100  0.50    1.6255470  0.3617986   0.7934839
  1       0.100  1.00    2.1657717  0.3261827   0.9086161
  1       0.100  2.00    2.6400839  0.2696350   1.0311687
  1       1.000  0.25    2.6536636  0.2623733   1.0482108
  1       1.000  0.50    2.8187630  0.2456215   1.1061679
  1       1.000  1.00    3.0821454  0.2390803   1.1683853
  1       1.000  2.00    3.5834123  0.2470449   1.2751313
  2       0.001  0.25    0.8612681  0.4069234   0.6967042
  2       0.001  0.50    0.8071078  0.4165077   0.6658916
  2       0.001  1.00    0.7570117  0.4367414   0.6246751
  2       0.001  2.00    0.7587181  0.4434109   0.5985194
  2       0.010  0.25    1.0514732  0.3453416   0.6570466
  2       0.010  0.50    1.1891133  0.3540641   0.6707408
  2       0.010  1.00    1.7145890  0.2902642   0.7891574
  2       0.010  2.00    1.9345630  0.2801307   0.8330883
  2       0.100  0.25    4.9842790  0.2587619   1.4734784
  2       0.100  0.50    5.5729776  0.2540084   1.5928851
  2       0.100  1.00    5.5729776  0.2540084   1.5928851
  2       0.100  2.00    5.5729776  0.2540084   1.5928851
  2       1.000  0.25    6.6374691  0.1881592   1.8621289
  2       1.000  0.50    6.6374691  0.1881592   1.8621289
  2       1.000  1.00    6.6374691  0.1881592   1.8621289
  2       1.000  2.00    6.6374691  0.1881592   1.8621289
  3       0.001  0.25    0.8354211  0.3633199   0.6879510
  3       0.001  0.50    0.7714611  0.4341537   0.6400390
  3       0.001  1.00    0.7455086  0.4518126   0.6028351
  3       0.001  2.00    0.8309080  0.3847034   0.6142516
  3       0.010  0.25   15.4474754  0.2870653   3.5569843
  3       0.010  0.50   21.0469213  0.2757707   4.6848753
  3       0.010  1.00   24.2540434  0.2586997   5.3343507
  3       0.010  2.00   31.8667519  0.2595499   6.8544389
  3       0.100  0.25  119.8764181  0.2484157  24.7404900
  3       0.100  0.50  119.8764181  0.2484157  24.7404900
  3       0.100  1.00  119.8764181  0.2484157  24.7404900
  3       0.100  2.00  119.8764181  0.2484157  24.7404900
  3       1.000  0.25   49.9475146  0.2073258  10.8534458
  3       1.000  0.50   49.9475146  0.2073258  10.8534458
  3       1.000  1.00   49.9475146  0.2073258  10.8534458
  3       1.000  2.00   49.9475146  0.2073258  10.8534458

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were degree = 3, scale = 0.001 and C = 1.

print(svmPModel$bestTune)

   degree scale C
35      3 0.001 1

MARS

marsModel <- train(
  x = train_x,
  y = train_y,
  method = "earth",
  preProc = c("center", "scale"),
  trControl = ctrl,
  tuneLength = 10
)
print(marsModel)

Multivariate Adaptive Regression Spline 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 100, 99, 98, 99, 100 
Resampling results across tuning parameters:

  nprune  RMSE       Rsquared   MAE      
   2      0.7271135  0.4591539  0.5780916
   3      0.7182703  0.4896215  0.5709093
   5      0.6681445  0.5419800  0.5213217
   7      2.9453922  0.3857181  1.0154052
   8      3.1478889  0.4316809  1.0283935
  10      3.1797305  0.3887363  1.0624880
  12      3.1466059  0.4083874  1.0460210
  13      3.1616509  0.3970114  1.0587304
  15      3.1609251  0.3997024  1.0593647
  17      3.1609251  0.3997024  1.0593647

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 5 and degree = 1.

print(marsModel$bestFit)

NULL

Neural Network

nnModel <- train(
  x = train_x,
  y = train_y,
  method = "nnet",
  preProc = c("center", "scale"),
  trace = FALSE,
  tuneLength = 7,
  trControl = ctrl
)

Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.

print(nnModel)

Neural Network 

124 samples
 56 predictor

Pre-processing: centered (56), scaled (56) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 99, 99, 100, 99, 99 
Resampling results across tuning parameters:

  size  decay         RMSE       Rsquared   MAE      
   1    0.0000000000  0.9682669        NaN  0.7873824
   1    0.0001000000  0.9150490  0.3653951  0.7207956
   1    0.0003981072  0.8755967  0.3371277  0.6952522
   1    0.0015848932  0.8334053  0.4289406  0.6628271
   1    0.0063095734  0.8114338  0.4595784  0.6414999
   1    0.0251188643  0.8362465  0.4254025  0.6710512
   1    0.1000000000  0.8301152  0.4469966  0.6622558
   3    0.0000000000  0.9682424  0.3804808  0.7873682
   3    0.0001000000  0.8443653  0.4364528  0.6761996
   3    0.0003981072  0.7899455  0.4966880  0.6254486
   3    0.0015848932  0.8517819  0.3514565  0.6790311
   3    0.0063095734  0.8252297  0.4245470  0.6520447
   3    0.0251188643  0.8232849  0.4635433  0.6581931
   3    0.1000000000  0.8244168  0.4502662  0.6552868
   5    0.0000000000  0.9682669        NaN  0.7873824
   5    0.0001000000  0.8378285  0.4562766  0.6724116
   5    0.0003981072  0.8086785  0.4689954  0.6450542
   5    0.0015848932  0.8299925  0.4234782  0.6608541
   5    0.0063095734  0.8222530  0.4407092  0.6544801
   5    0.0251188643  0.8181780  0.4663482  0.6572138
   5    0.1000000000  0.8199729  0.4571137  0.6518002
   7    0.0000000000  0.9682668  0.1120367  0.7873824
   7    0.0001000000  0.8505934  0.4688188  0.6708782
   7    0.0003981072  0.8244911  0.4023881  0.6477988
   7    0.0015848932  0.8198216  0.4436599  0.6563836
   7    0.0063095734  0.8133107  0.4686115  0.6576149
   7    0.0251188643  0.8095148  0.4972942  0.6428775
   7    0.1000000000  0.8221814  0.4519543  0.6525025
   9    0.0000000000  0.9682669        NaN  0.7873824
   9    0.0001000000  0.8336860  0.3973938  0.6599252
   9    0.0003981072  0.8026328  0.4572917  0.6366140
   9    0.0015848932  0.8170857  0.4452725  0.6481996
   9    0.0063095734  0.8176285  0.4530313  0.6521788
   9    0.0251188643  0.8124691  0.4781019  0.6476017
   9    0.1000000000  0.8225707  0.4466223  0.6518940
  11    0.0000000000  0.9682669        NaN  0.7873824
  11    0.0001000000  0.8066660  0.4484055  0.6355455
  11    0.0003981072  0.8365208  0.3695474  0.6724226
  11    0.0015848932  0.8226359  0.4259208  0.6531413
  11    0.0063095734  0.8183817  0.4594450  0.6534696
  11    0.0251188643  0.8097759  0.4915198  0.6540279
  11    0.1000000000  0.8231418  0.4430660  0.6524492
  13    0.0000000000  0.9682669        NaN  0.7873824
  13    0.0001000000  0.8218718  0.4043562  0.6512450
  13    0.0003981072  0.8286873  0.4045652  0.6417256
  13    0.0015848932  0.8199272  0.4439009  0.6500867
  13    0.0063095734  0.8269164  0.4184251  0.6533317
  13    0.0251188643  0.8245829  0.4269047  0.6526659
  13    0.1000000000  0.8217071  0.4485951  0.6529192

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 3 and decay = 0.0003981072.

print(nnModel$bestTune)

   size        decay
10    3 0.0003981072

7.5.a

Which nonlinear regression model gives the optimal resampling and test set performance?

Let’s try and automate this and see if we can pluck the optimal resampling model out of our set of models.

models <- list(
  KNN = knnModel,
  SVMLinear = svmLModel,
  SVMRadial = svmRModel,
  SVMPoly = svmPModel,
  MARS = marsModel,
  NNEt = nnModel
)

results <- tibble(Model = names(models), ParamNames=NA, ParamValues = NA, RMSE = NA, RSquared = NA)

for (i in seq(1, length(models))) {
  model <- models[[i]]
  results$RMSE[i] <- min(model$results$RMSE)
  # Hmmm, I'm presuming that the lowest RMSE is also the best fit...
  results$ParamNames[i] = names(model$bestTune)
  results$ParamValues[i] = as_vector(model$bestTune)
}

results <- arrange(results, RMSE)
print(results)

# A tibble: 6 × 5
  Model     ParamNames ParamValues  RMSE RSquared
  <chr>     <chr>            <dbl> <dbl> <lgl>   
1 SVMRadial sigma           0.0124 0.644 NA      
2 MARS      nprune          5      0.668 NA      
3 KNN       k               5      0.690 NA      
4 SVMPoly   degree          3      0.746 NA      
5 NNEt      size            3      0.790 NA      
6 SVMLinear C               1      2.85  NA

Wow, it was a close race. Our champion, SVM with a radial kernel: \(\sigma=0.0124\), weighs in with an RMSE of \(~0.6443\).

7.5.b

Which predictors are most important in the optimal nonlinear regression model?

varImp(svmRModel)

loess r-squared variable importance

  only 20 most important variables shown (out of 56)

                       Overall
ManufacturingProcess32  100.00
BiologicalMaterial06     80.33
BiologicalMaterial02     72.54
ManufacturingProcess13   70.64
ManufacturingProcess36   60.52
BiologicalMaterial03     60.25
ManufacturingProcess31   58.55
ManufacturingProcess09   54.30
BiologicalMaterial04     53.08
BiologicalMaterial12     51.95
ManufacturingProcess17   51.34
ManufacturingProcess02   50.26
ManufacturingProcess29   49.66
BiologicalMaterial01     45.83
ManufacturingProcess33   43.46
BiologicalMaterial08     35.54
BiologicalMaterial11     34.29
ManufacturingProcess06   33.04
ManufacturingProcess04   31.17
ManufacturingProcess11   30.28

As you can see, the most important is ManufacturingProcess32, but ManufacturingProcessing isn’t dominating the way it did with linear regression. Of the first 10 most important predictors, 5 of them are BiologicalMaterial.

Do either the biological or process variables dominate the list?

Welly, welly, I jumped the gun. Let’s consider the entire list this time. Of the first 20, 8 of them are BiologicalMaterial and 12 are ManufacturingProcess. It would appear that ManufacturingProcess is still dominating the list, and they are. But, considering that there are far less BiologicalMaterial predictors than ManufacturingProcess predictors, one could say that BiologicalMaterial is relatively dominating the list:

\[ \frac{8}{12} \gt \frac{12}{37} \]

And if we take their weighted sum of importance (let’s restrict it to the first 10)?

manufacturing = sum(100, 70.63862, 60.51993, 58.54715, 54.30407)
biological = sum(80.33204, 72.54240, 60.25020, 53.08210, 51.94877)
manufacturing

[1] 344.0098

biological

[1] 318.1555

Manufacturing comes out on top. In conclusion, I wouldn’t say that either is dominating, but we can say that ManufacturingProcess has a greater influence on the ranking of predictor’s importance to SVM radial than BiologicalMaterial.

How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

BiologicalMaterial has come a long way. If we look back to linear regressions rankings we will see the first 20 ranked as follows:

Predictor	Rank	Index
ManufacturingProcess32	100.0000000	1
ManufacturingProcess04	84.2112599	2
ManufacturingProcess28	80.4631816	3
ManufacturingProcess43	64.8538418	4
ManufacturingProcess33	62.4884204	5
ManufacturingProcess12	57.8543025	6
ManufacturingProcess35	38.2649373	7
ManufacturingProcess37	38.0256614	8
BiologicalMaterial11	37.2042557	9
ManufacturingProcess10	36.5106556	10
ManufacturingProcess18	34.2042194	11
ManufacturingProcess20	33.1341354	12
ManufacturingProcess29	32.2528714	13
BiologicalMaterial02	31.0298498	14
ManufacturingProcess05	30.7444067	15
ManufacturingProcess26	30.1456411	16
ManufacturingProcess13	29.5391977	17
ManufacturingProcess25	28.3191327	18
ManufacturingProcess11	27.9038396	19
ManufacturingProcess24	27.6671361	20

Vs. non-linear regression (SVM)

varImp(svmRModel)$importance |>
  arrange(desc(Overall)) |>
  mutate(Index = row_number()) |>
  head(n = 20)

                         Overall Index
ManufacturingProcess32 100.00000     1
BiologicalMaterial06    80.33204     2
BiologicalMaterial02    72.54240     3
ManufacturingProcess13  70.63862     4
ManufacturingProcess36  60.51993     5
BiologicalMaterial03    60.25020     6
ManufacturingProcess31  58.54715     7
ManufacturingProcess09  54.30407     8
BiologicalMaterial04    53.08210     9
BiologicalMaterial12    51.94877    10
ManufacturingProcess17  51.33595    11
ManufacturingProcess02  50.26472    12
ManufacturingProcess29  49.66376    13
BiologicalMaterial01    45.83156    14
ManufacturingProcess33  43.46117    15
BiologicalMaterial08    35.53841    16
BiologicalMaterial11    34.28960    17
ManufacturingProcess06  33.03744    18
ManufacturingProcess04  31.16695    19
ManufacturingProcess11  30.27655    20

We see a much greater contribution of BiologicalMaterial in non-linear regression than we did with linear regression. There are only two BiologicalMaterial predictors in the first 20 variables of importance for linear regression.

7.5.c

Explore relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Let’s isolate the top 10 predictors

df <- varImp(svmRModel)$importance |>
  arrange(desc(Overall)) |>
  slice_head(n = 10)

df <- ChemicalManufacturingProcess[rownames(df)]
# Let's shorten the predictor names. They are obscuring data in our correlation plot.
names(df) <- c("MP32", "BM06", "BM02", "MP13", "MP36", "BM03", "MP31", "MP09", "BM04", "BM12")

# Add yield as `Y`
df$Y <- ChemicalManufacturingProcess$Yield

Now let’s have a look at our data using GGally. We’ll use a heat-map to find the correlation between all pairs in our set of variables.

GGally::ggcorr(df, label = TRUE)

Interesting! The predictor with the highest correlation with yield is also ranked the most important predictor (MP32). And the absolute value of the second (BM06), third (BM02), fourth (MP13) and fifth (MP36) all have a rounded correlation of \(0.5\) with Yield. This is a good sign.

Also, there is a lot of correlation between some of our predictors. Let’s identify the top 5.

cor(df$BM02, df$BM06)

[1] 0.9543113

cor(df$BM02, df$BM03)

[1] 0.8607901

cor(df$BM06, df$BM03)

[1] 0.8723637

cor(df$BM06, df$BM12)

[1] 0.812854

cor(df$BM02, df$BM12)

[1] 0.7793419

We very likely could have used PCA to reduce the dimensionality of our data. But we are doing the pre-processing during post-processing.

Before we wrap it up, I’d like to see some of the correlation. We tried to plot them all in one plot, but it was too crowded. So we faceted the data. And we normalized it to make it easier to see the relationships.

df |>
  mutate(
    Index = row_number(),
    MP32 = (MP32-min(MP32)) / (max(MP32) - min(MP32)),
    BM06 = (BM06-min(BM06)) / (max(BM06) - min(BM06)),
    BM02 = (BM02-min(BM02)) / (max(BM02) - min(BM02)),
    MP13 = (MP13-min(MP13)) / (max(MP13) - min(MP13)),
    MP36 = (MP36-min(MP36)) / (max(MP36) - min(MP36)),
    BM03 = (BM03-min(BM03)) / (max(BM03) - min(BM03)),
    MP31 = (MP31-min(MP31)) / (max(MP31) - min(MP31)),
    MP09 = (MP09-min(MP09)) / (max(MP09) - min(MP09)),
    BM04 = (BM04-min(BM04)) / (max(BM04) - min(BM04)),
    BM12 = (BM12-min(BM12)) / (max(BM12) - min(BM12)),
    Y = (Y-min(Y)) / (max(Y) - min(Y))
  ) |>
  pivot_longer(cols = -Index, names_to = "Variable", values_to = "Value") |>
  ggplot(mapping = aes(x = Index, y = Value, color = Variable)) +
  geom_line() +
  labs(title = "Top 10 Predictors and Yield") +
  facet_wrap(~ Variable, ncol = 2)

Warning: Removed 352 rows containing missing values or values outside the scale range
(`geom_line()`).

Where are those missing values coming from!? Oh geez, I pulled the 10 top predictors from ChemicalManufacturingProcess. This was a mistake, but at the same time I think it’s a little fortuitous. Our correlation calculations, normalization and plotting is based on observed data.

Let’s count our missing values in df.

df |>
  summarise(across(everything(), ~ sum(is.na(.))))

  MP32 BM06 BM02 MP13 MP36 BM03 MP31 MP09 BM04 BM12 Y
1    0    0    0    0    5    0    5    0    0    0 0

And that is what our plot is showing us. We can live with that. Especially being that we need to turn it in :).