7.2 and 7.5

7.2. Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:

y = 10sin(πx1x2)+20(x3−0.5)2 +10x4 +5x5 +N(0,σ2)

where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
set.seed(200)

trainingData <- mlbench.friedman1(200, sd = 1)

## We convert the 'x' data from a matrix to a data frame

## One reason is that this will give the columns names.

trainingData$x <- data.frame(trainingData$x)

## Look at the data using

featurePlot(trainingData$x, trainingData$y)

## or other methods.


## This creates a list with a vector 'y' and a matrix

# of predictors 'x'. Also simulate a large test set to

## estimate the true error rate with good precision:

testData <- mlbench.friedman1(5000, sd = 1)

testData$x <- data.frame(testData$x)

Tune several models on these data. For example:

knnModel <- train(x = trainingData$x, 
                    y = trainingData$y, method = "knn",
                    preProc = c("center", "scale"), 
                    tuneLength = 10)

knnModel
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)

## The function 'postResample' can be used to get the test set

## perforamnce values

postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

MARS

library(earth)
## Warning: package 'earth' was built under R version 4.5.3
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.5.2
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
marsFit <- earth(trainingData$x, trainingData$y)

marsFit
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
summary(marsFit)
## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)


set.seed(100)

# Explicitly declare the candidate models to test
marsTuned <- train(trainingData$x, trainingData$y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   preProcess = c("center", "scale"),
                   trControl = trainControl(method = "cv"))

marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      4.327937  0.2544880  3.6004742
##   1        3      3.572450  0.4912720  2.8958113
##   1        4      2.596841  0.7183600  2.1063410
##   1        5      2.370161  0.7659777  1.9186686
##   1        6      2.276141  0.7881481  1.8100006
##   1        7      1.766728  0.8751831  1.3902146
##   1        8      1.780946  0.8723243  1.4013449
##   1        9      1.665091  0.8819775  1.3255147
##   1       10      1.663804  0.8821283  1.3276573
##   1       11      1.657738  0.8822967  1.3317299
##   1       12      1.653784  0.8827903  1.3315041
##   1       13      1.648496  0.8823663  1.3164065
##   1       14      1.639073  0.8841742  1.3128329
##   1       15      1.639073  0.8841742  1.3128329
##   1       16      1.639073  0.8841742  1.3128329
##   1       17      1.639073  0.8841742  1.3128329
##   1       18      1.639073  0.8841742  1.3128329
##   1       19      1.639073  0.8841742  1.3128329
##   1       20      1.639073  0.8841742  1.3128329
##   1       21      1.639073  0.8841742  1.3128329
##   1       22      1.639073  0.8841742  1.3128329
##   1       23      1.639073  0.8841742  1.3128329
##   1       24      1.639073  0.8841742  1.3128329
##   1       25      1.639073  0.8841742  1.3128329
##   1       26      1.639073  0.8841742  1.3128329
##   1       27      1.639073  0.8841742  1.3128329
##   1       28      1.639073  0.8841742  1.3128329
##   1       29      1.639073  0.8841742  1.3128329
##   1       30      1.639073  0.8841742  1.3128329
##   1       31      1.639073  0.8841742  1.3128329
##   1       32      1.639073  0.8841742  1.3128329
##   1       33      1.639073  0.8841742  1.3128329
##   1       34      1.639073  0.8841742  1.3128329
##   1       35      1.639073  0.8841742  1.3128329
##   1       36      1.639073  0.8841742  1.3128329
##   1       37      1.639073  0.8841742  1.3128329
##   1       38      1.639073  0.8841742  1.3128329
##   2        2      4.327937  0.2544880  3.6004742
##   2        3      3.572450  0.4912720  2.8958113
##   2        4      2.661826  0.7070510  2.1734709
##   2        5      2.404015  0.7578971  1.9753867
##   2        6      2.243927  0.7914805  1.7830717
##   2        7      1.856336  0.8605482  1.4356822
##   2        8      1.754607  0.8763186  1.3968406
##   2        9      1.653859  0.8870129  1.2813884
##   2       10      1.434159  0.9166537  1.1339203
##   2       11      1.320482  0.9289120  1.0347278
##   2       12      1.317547  0.9306879  1.0359899
##   2       13      1.296910  0.9306902  1.0146112
##   2       14      1.221407  0.9395223  0.9631486
##   2       15      1.230516  0.9390469  0.9761484
##   2       16      1.236911  0.9387407  0.9745362
##   2       17      1.236911  0.9387407  0.9745362
##   2       18      1.236911  0.9387407  0.9745362
##   2       19      1.236911  0.9387407  0.9745362
##   2       20      1.236911  0.9387407  0.9745362
##   2       21      1.236911  0.9387407  0.9745362
##   2       22      1.236911  0.9387407  0.9745362
##   2       23      1.236911  0.9387407  0.9745362
##   2       24      1.236911  0.9387407  0.9745362
##   2       25      1.236911  0.9387407  0.9745362
##   2       26      1.236911  0.9387407  0.9745362
##   2       27      1.236911  0.9387407  0.9745362
##   2       28      1.236911  0.9387407  0.9745362
##   2       29      1.236911  0.9387407  0.9745362
##   2       30      1.236911  0.9387407  0.9745362
##   2       31      1.236911  0.9387407  0.9745362
##   2       32      1.236911  0.9387407  0.9745362
##   2       33      1.236911  0.9387407  0.9745362
##   2       34      1.236911  0.9387407  0.9745362
##   2       35      1.236911  0.9387407  0.9745362
##   2       36      1.236911  0.9387407  0.9745362
##   2       37      1.236911  0.9387407  0.9745362
##   2       38      1.236911  0.9387407  0.9745362
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
head(predict(marsTuned, trainingData$x))
##             y
## [1,] 18.03158
## [2,] 15.61875
## [3,] 17.74888
## [4,] 12.26653
## [5,] 19.74822
## [6,] 20.60847
#checking the most important variables
varImp(marsTuned)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00
MarsPred <- predict(marsTuned, testData$x)
postResample(pred = MarsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.2779993 0.9338365 1.0147070

MARS’ R-Squared is .9334, which means it explains most of the variance.

SVM

library(kernlab)
## Warning: package 'kernlab' was built under R version 4.5.2
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
set.seed(1122)
svmRTuned <- train(trainingData$x, trainingData$y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))


svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE     
##      0.25  2.464385  0.8140734  1.999374
##      0.50  2.215541  0.8277944  1.792697
##      1.00  2.030869  0.8484840  1.638046
##      2.00  1.887732  0.8655903  1.492305
##      4.00  1.762290  0.8793559  1.396959
##      8.00  1.706200  0.8847200  1.362491
##     16.00  1.706837  0.8857520  1.376093
##     32.00  1.706601  0.8858170  1.376046
##     64.00  1.706601  0.8858170  1.376046
##    128.00  1.706601  0.8858170  1.376046
##    256.00  1.706601  0.8858170  1.376046
##    512.00  1.706601  0.8858170  1.376046
##   1024.00  1.706601  0.8858170  1.376046
##   2048.00  1.706601  0.8858170  1.376046
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06096343
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06096343 and C = 8.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 8 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0609634348804832 
## 
## Number of Support Vectors : 154 
## 
## Objective Function Value : -75.871 
## Training error : 0.00936
head(predict(svmRTuned, testData$x))
## [1] 19.210924 21.859312 13.012286  7.552461 12.553335 14.018247
svmPred <- predict(svmRTuned, testData$x)
postResample(pred = svmPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.0462947 0.8302912 1.5521846

RMSE is slightly higher than MARS at 2.04 and r-squared is a little lower at .83, but still explains a lot of the variance, and more than KNN.

Neural Networks

Checking the x values with trainingData$x, they appear to be on a similar scale (all from 0-.99, so I will use the nnet package.

I went with the paramaters from the book, including 5 hidden variables.

library(nnet)
set.seed(100)
nnetFit <- nnet(trainingData$x, trainingData$y, size = 5,
                decay = .01,
                linout = TRUE,
                trace = FALSE, 
                maxit = 500, 
                MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nnPred <- predict(nnetFit, testData$x)
postResample(pred = nnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.7850964 0.7033146 2.1812067

RMSE is 2.78 and R-squared is .70. Let’s try with the caret package, which seems to offer more help with tuning.

nnetGrid <- expand.grid(.size = c(1, 3, 5, 7, 10),
                        .decay = c(0, 0.01, 0.1, 1))
set.seed(1122)
nnetTune <- train(trainingData$x, trainingData$y,
                  method = "nnet",
                  preProc = c("center", "scale"),
                  tuneGrid = nnetGrid,
                  trControl = trainControl(method = "cv"),
                  linout = TRUE,
                  trace = FALSE,
                  maxit = 500,
                  MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1)

# which size is best
nnetTune$bestTune
##    size decay
## 12    5     1

Shows a size of 5 (5 hidden predictors), and a decay of 1 instead of .1. the fit:

nnPred <- predict(nnetTune, testData$x)
postResample(pred = nnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.7217986 0.7036043 2.1329360

This model offers a similar RMSE and r-squared to the model with decay .01.

Running it again with the nnet package and corrected decay:

set.seed(100)
nnetFit <- nnet(trainingData$x, trainingData$y, size = 5,
                decay = 1,
                linout = TRUE,
                trace = FALSE, 
                maxit = 500, 
                MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nnPred <- predict(nnetFit, testData$x)
postResample(pred = nnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.2106642 0.8027398 1.6834260

Running it again with the nnet package and the correct decay, we get a slightly better RMSE of 2.21 and R-squared of .803.

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

MARS seems to have the best performance, with the highest R-squared at around .9. The RMSE is pretty low, so it appears to be the best fit overall.

Mars selects X1 - X5 (though not in order):

varImp(marsTuned)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(ChemicalManufacturingProcess)

#check the NA values
colSums(is.na(ChemicalManufacturingProcess))
##                  Yield   BiologicalMaterial01   BiologicalMaterial02 
##                      0                      0                      0 
##   BiologicalMaterial03   BiologicalMaterial04   BiologicalMaterial05 
##                      0                      0                      0 
##   BiologicalMaterial06   BiologicalMaterial07   BiologicalMaterial08 
##                      0                      0                      0 
##   BiologicalMaterial09   BiologicalMaterial10   BiologicalMaterial11 
##                      0                      0                      0 
##   BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02 
##                      0                      1                      3 
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05 
##                     15                      1                      1 
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08 
##                      2                      1                      1 
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11 
##                      0                      9                     10 
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14 
##                      1                      0                      1 
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17 
##                      0                      0                      0 
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20 
##                      0                      0                      0 
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23 
##                      0                      1                      1 
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26 
##                      1                      5                      5 
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29 
##                      5                      5                      5 
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32 
##                      5                      5                      0 
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35 
##                      5                      5                      5 
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38 
##                      5                      0                      0 
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41 
##                      0                      1                      1 
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44 
##                      0                      0                      0 
## ManufacturingProcess45 
##                      0
#VIM package for simple imputation 
library(VIM)
## Warning: package 'VIM' was built under R version 4.5.3
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.5.3
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
set.seed(1122)

chem_imputed <- kNN(ChemicalManufacturingProcess, k = 5)

#check that there are no missing values
colSums(is.na(chem_imputed))
##                      Yield       BiologicalMaterial01 
##                          0                          0 
##       BiologicalMaterial02       BiologicalMaterial03 
##                          0                          0 
##       BiologicalMaterial04       BiologicalMaterial05 
##                          0                          0 
##       BiologicalMaterial06       BiologicalMaterial07 
##                          0                          0 
##       BiologicalMaterial08       BiologicalMaterial09 
##                          0                          0 
##       BiologicalMaterial10       BiologicalMaterial11 
##                          0                          0 
##       BiologicalMaterial12     ManufacturingProcess01 
##                          0                          0 
##     ManufacturingProcess02     ManufacturingProcess03 
##                          0                          0 
##     ManufacturingProcess04     ManufacturingProcess05 
##                          0                          0 
##     ManufacturingProcess06     ManufacturingProcess07 
##                          0                          0 
##     ManufacturingProcess08     ManufacturingProcess09 
##                          0                          0 
##     ManufacturingProcess10     ManufacturingProcess11 
##                          0                          0 
##     ManufacturingProcess12     ManufacturingProcess13 
##                          0                          0 
##     ManufacturingProcess14     ManufacturingProcess15 
##                          0                          0 
##     ManufacturingProcess16     ManufacturingProcess17 
##                          0                          0 
##     ManufacturingProcess18     ManufacturingProcess19 
##                          0                          0 
##     ManufacturingProcess20     ManufacturingProcess21 
##                          0                          0 
##     ManufacturingProcess22     ManufacturingProcess23 
##                          0                          0 
##     ManufacturingProcess24     ManufacturingProcess25 
##                          0                          0 
##     ManufacturingProcess26     ManufacturingProcess27 
##                          0                          0 
##     ManufacturingProcess28     ManufacturingProcess29 
##                          0                          0 
##     ManufacturingProcess30     ManufacturingProcess31 
##                          0                          0 
##     ManufacturingProcess32     ManufacturingProcess33 
##                          0                          0 
##     ManufacturingProcess34     ManufacturingProcess35 
##                          0                          0 
##     ManufacturingProcess36     ManufacturingProcess37 
##                          0                          0 
##     ManufacturingProcess38     ManufacturingProcess39 
##                          0                          0 
##     ManufacturingProcess40     ManufacturingProcess41 
##                          0                          0 
##     ManufacturingProcess42     ManufacturingProcess43 
##                          0                          0 
##     ManufacturingProcess44     ManufacturingProcess45 
##                          0                          0 
##                  Yield_imp   BiologicalMaterial01_imp 
##                          0                          0 
##   BiologicalMaterial02_imp   BiologicalMaterial03_imp 
##                          0                          0 
##   BiologicalMaterial04_imp   BiologicalMaterial05_imp 
##                          0                          0 
##   BiologicalMaterial06_imp   BiologicalMaterial07_imp 
##                          0                          0 
##   BiologicalMaterial08_imp   BiologicalMaterial09_imp 
##                          0                          0 
##   BiologicalMaterial10_imp   BiologicalMaterial11_imp 
##                          0                          0 
##   BiologicalMaterial12_imp ManufacturingProcess01_imp 
##                          0                          0 
## ManufacturingProcess02_imp ManufacturingProcess03_imp 
##                          0                          0 
## ManufacturingProcess04_imp ManufacturingProcess05_imp 
##                          0                          0 
## ManufacturingProcess06_imp ManufacturingProcess07_imp 
##                          0                          0 
## ManufacturingProcess08_imp ManufacturingProcess09_imp 
##                          0                          0 
## ManufacturingProcess10_imp ManufacturingProcess11_imp 
##                          0                          0 
## ManufacturingProcess12_imp ManufacturingProcess13_imp 
##                          0                          0 
## ManufacturingProcess14_imp ManufacturingProcess15_imp 
##                          0                          0 
## ManufacturingProcess16_imp ManufacturingProcess17_imp 
##                          0                          0 
## ManufacturingProcess18_imp ManufacturingProcess19_imp 
##                          0                          0 
## ManufacturingProcess20_imp ManufacturingProcess21_imp 
##                          0                          0 
## ManufacturingProcess22_imp ManufacturingProcess23_imp 
##                          0                          0 
## ManufacturingProcess24_imp ManufacturingProcess25_imp 
##                          0                          0 
## ManufacturingProcess26_imp ManufacturingProcess27_imp 
##                          0                          0 
## ManufacturingProcess28_imp ManufacturingProcess29_imp 
##                          0                          0 
## ManufacturingProcess30_imp ManufacturingProcess31_imp 
##                          0                          0 
## ManufacturingProcess32_imp ManufacturingProcess33_imp 
##                          0                          0 
## ManufacturingProcess34_imp ManufacturingProcess35_imp 
##                          0                          0 
## ManufacturingProcess36_imp ManufacturingProcess37_imp 
##                          0                          0 
## ManufacturingProcess38_imp ManufacturingProcess39_imp 
##                          0                          0 
## ManufacturingProcess40_imp ManufacturingProcess41_imp 
##                          0                          0 
## ManufacturingProcess42_imp ManufacturingProcess43_imp 
##                          0                          0 
## ManufacturingProcess44_imp ManufacturingProcess45_imp 
##                          0                          0
chem_final <- chem_imputed |>
  dplyr::select(-ends_with("_imp"))

#leaving out the part where I checked for NZV becuase I didn't end up removing any 
#variables

set.seed(1122)
train_index <- sample(nrow(chem_final), 0.8 * nrow(chem_final))

#train and test sets (yield + predictors are all in the same df)
train_y <- chem_final$Yield[train_index]
train_x <- chem_final[train_index, ] |> dplyr::select(-Yield)


test_y <- chem_final$Yield[-train_index]
test_x  <- chem_final[-train_index, ] |> dplyr::select(-Yield)

MARS

# Define the candidate models to test

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

# Fix the seed so that the results can be reproduced

set.seed(100)

# Explicitly declare the candidate models to test
marsTuned <- train(train_x, train_y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   preProcess = c("center", "scale"),
                   trControl = trainControl(method = "cv"))

marsTuned
## Multivariate Adaptive Regression Spline 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 125, 127, 125, 127, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.445585  0.4286929  1.1697364
##   1        3      1.271529  0.5560206  1.0487655
##   1        4      1.221652  0.6127651  0.9909329
##   1        5      1.263234  0.5785473  1.0424573
##   1        6      1.276463  0.5764660  1.0476750
##   1        7      1.259923  0.5988646  1.0247931
##   1        8      1.256659  0.6008041  1.0349147
##   1        9      1.323914  0.5714970  1.0810989
##   1       10      1.329992  0.5720971  1.0872536
##   1       11      1.352325  0.5641226  1.1114240
##   1       12      1.376280  0.5457851  1.1369027
##   1       13      1.365723  0.5536315  1.1278057
##   1       14      1.363425  0.5531568  1.1218102
##   1       15      1.363425  0.5531568  1.1218102
##   1       16      1.363425  0.5531568  1.1218102
##   1       17      1.370906  0.5494370  1.1288829
##   1       18      1.375610  0.5477760  1.1327603
##   1       19      1.375610  0.5477760  1.1327603
##   1       20      1.375610  0.5477760  1.1327603
##   1       21      1.375610  0.5477760  1.1327603
##   1       22      1.375610  0.5477760  1.1327603
##   1       23      1.375610  0.5477760  1.1327603
##   1       24      1.375610  0.5477760  1.1327603
##   1       25      1.375610  0.5477760  1.1327603
##   1       26      1.375610  0.5477760  1.1327603
##   1       27      1.375610  0.5477760  1.1327603
##   1       28      1.375610  0.5477760  1.1327603
##   1       29      1.375610  0.5477760  1.1327603
##   1       30      1.375610  0.5477760  1.1327603
##   1       31      1.375610  0.5477760  1.1327603
##   1       32      1.375610  0.5477760  1.1327603
##   1       33      1.375610  0.5477760  1.1327603
##   1       34      1.375610  0.5477760  1.1327603
##   1       35      1.375610  0.5477760  1.1327603
##   1       36      1.375610  0.5477760  1.1327603
##   1       37      1.375610  0.5477760  1.1327603
##   1       38      1.375610  0.5477760  1.1327603
##   2        2      1.445585  0.4286929  1.1697364
##   2        3      1.364837  0.4910211  1.1169739
##   2        4      1.295405  0.5445814  1.0655822
##   2        5      1.325028  0.5492070  1.0762444
##   2        6      1.291145  0.5884713  1.0574915
##   2        7      1.315753  0.5960677  1.1061360
##   2        8      1.312895  0.5974567  1.0968864
##   2        9      1.284863  0.6086840  1.0661867
##   2       10      1.270438  0.6409868  1.0350719
##   2       11      1.232756  0.6412548  1.0116010
##   2       12      1.224191  0.6466803  1.0070967
##   2       13      1.242985  0.6465221  1.0258268
##   2       14      1.276888  0.6411067  1.0560455
##   2       15      1.360652  0.5848061  1.1078526
##   2       16      1.379017  0.5817154  1.1042628
##   2       17      1.357163  0.5946035  1.0915653
##   2       18      1.372913  0.5930793  1.1036115
##   2       19      1.430412  0.5617581  1.1629271
##   2       20      1.422250  0.5733586  1.1580907
##   2       21      1.424015  0.5815266  1.1529841
##   2       22      1.427942  0.5834527  1.1579311
##   2       23      1.412013  0.5870073  1.1441247
##   2       24      1.408759  0.5881885  1.1343740
##   2       25      1.408759  0.5881885  1.1343740
##   2       26      1.408759  0.5881885  1.1343740
##   2       27      1.408759  0.5881885  1.1343740
##   2       28      1.428444  0.5877352  1.1434987
##   2       29      1.428444  0.5877352  1.1434987
##   2       30      1.428444  0.5877352  1.1434987
##   2       31      1.428444  0.5877352  1.1434987
##   2       32      1.428444  0.5877352  1.1434987
##   2       33      1.428444  0.5877352  1.1434987
##   2       34      1.428444  0.5877352  1.1434987
##   2       35      1.428444  0.5877352  1.1434987
##   2       36      1.428444  0.5877352  1.1434987
##   2       37      1.428444  0.5877352  1.1434987
##   2       38      1.428444  0.5877352  1.1434987
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.
head(predict(marsTuned, test_x))
##             y
## [1,] 43.15080
## [2,] 40.94592
## [3,] 42.30278
## [4,] 39.75909
## [5,] 39.47725
## [6,] 41.49819
#checking the most important variables
varImp(marsTuned)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32   100.0
## ManufacturingProcess09    47.8
## ManufacturingProcess13     0.0

According to this output, only two variables have high importance.

MarsPred <- predict(marsTuned, test_x)
postResample(pred = MarsPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.9180288 0.6608032 0.7386973

The RMSE is really small at .918 and the R-Squared is ~.66, so it explains about 66% of the variability. This is better than the linear models I tried.

KNN

Using functions from the caret package:

#there was one variable in this data set with a NZV, so we will run this: 
knnDescr <- train_x[, -nearZeroVar(train_x)]

set.seed(100)

knnTune <- train(knnDescr,
                 train_y,
                 method = "knn",
                 preProc = c("center", "scale"),
                 tuneGrid = data.frame(.k = 1:20),
                 trControl = trainControl(method = "cv"))

knnTune
## k-Nearest Neighbors 
## 
## 140 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 125, 127, 125, 127, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    1  1.572664  0.4355856  1.177938
##    2  1.375642  0.5000665  1.076028
##    3  1.320918  0.5551471  1.089611
##    4  1.302554  0.5505480  1.034190
##    5  1.341124  0.5312473  1.071822
##    6  1.336037  0.5427579  1.064899
##    7  1.347944  0.5279841  1.087428
##    8  1.344336  0.5279758  1.105406
##    9  1.353617  0.5331393  1.112105
##   10  1.354839  0.5324870  1.117459
##   11  1.382060  0.5017001  1.131944
##   12  1.370833  0.5053568  1.116786
##   13  1.390234  0.4893131  1.128413
##   14  1.385492  0.5027572  1.127644
##   15  1.399392  0.4899669  1.139616
##   16  1.408471  0.4807428  1.149303
##   17  1.429706  0.4632304  1.173909
##   18  1.437080  0.4598546  1.170098
##   19  1.450011  0.4490827  1.191298
##   20  1.443794  0.4618061  1.189157
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 4.
head(predict(knnTune, test_x))
## [1] 42.025 41.975 42.015 41.425 40.080 40.210
#most important variables are listed in part B

Test the fit:

knnPred <- predict(knnTune, test_x)
postResample(pred = knnPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 1.1632277 0.4874411 0.9598611

The RMSE is pretty low at ~1.16, but R-squared is only ~.49, explaining only about half of the variance.

SVM

set.seed(1122)
svmRTuned <- train(train_x, train_y,
                   method = "svmRadial",
                   preProc = c("center", "scale"),
                   tuneLength = 14,
                   trControl = trainControl(method = "cv"))


svmRTuned
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 140 samples
##  57 predictor
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 125, 126, 126, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   C        RMSE      Rsquared   MAE      
##      0.25  1.452498  0.5109138  1.1701907
##      0.50  1.321489  0.5672822  1.0590686
##      1.00  1.238206  0.6088460  0.9771733
##      2.00  1.192591  0.6268670  0.9248975
##      4.00  1.177580  0.6330657  0.9228731
##      8.00  1.175891  0.6316652  0.9237709
##     16.00  1.175405  0.6318894  0.9231476
##     32.00  1.175405  0.6318894  0.9231476
##     64.00  1.175405  0.6318894  0.9231476
##    128.00  1.175405  0.6318894  0.9231476
##    256.00  1.175405  0.6318894  0.9231476
##    512.00  1.175405  0.6318894  0.9231476
##   1024.00  1.175405  0.6318894  0.9231476
##   2048.00  1.175405  0.6318894  0.9231476
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01530119
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01530119 and C = 16.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: eps-svr  (regression) 
##  parameter : epsilon = 0.1  cost C = 16 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0153011896520962 
## 
## Number of Support Vectors : 122 
## 
## Objective Function Value : -70.0997 
## Training error : 0.009114
head(predict(svmRTuned, test_x))
## [1] 42.41294 42.66431 42.09665 40.51665 40.33008 40.34078
#most important variables are listed in part B
#varimp doesn't work on this model

Check the fit:

svmPred <- predict(svmRTuned, test_x)
postResample(pred = svmPred, obs = test_y)
##      RMSE  Rsquared       MAE 
## 0.9669417 0.6307160 0.7920620

This is similar to MARS with an r-squared of 63% and RMSE .967.

(a ) Which nonlinear regression model gives the optimal resampling and test set performance?

MARS had the best performance with an RMSE at .918 and the R-Squared is ~.66. However, that is still on the low end (only explains 66% of variance), so I wouldn’t recommend it.

  1. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Reproducing the top variables for each model here.

MARS is the optimal model, and there are only two important variables, manufacturing process 32 and 09. In this case, manufacturing processes dominate the list.

#MARS (only shows three)
varImp(marsTuned)
## earth variable importance
## 
##                        Overall
## ManufacturingProcess32   100.0
## ManufacturingProcess09    47.8
## ManufacturingProcess13     0.0

Looking at the other models:

I used varImp with KNN and SVM, but realized something was wrong when the lists were exactly the same. VarImp seems to be pulling the top 20 most-correlated variables in general. This uses the VIP package to pull the top variables based on RMSE.
https://rdrr.io/cran/vip/src/R/vi_permute.R https://rpubs.com/erblast/mars

library(vip)
## Warning: package 'vip' was built under R version 4.5.3
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
#SVM
vi_svm <- vi(svmRTuned,
             method = "permute",
             target = train_y,
             metric = "RMSE",
             pred_wrapper = predict)

vi_svm
## # A tibble: 57 × 2
##    Variable               Importance
##    <chr>                       <dbl>
##  1 ManufacturingProcess32      0.526
##  2 ManufacturingProcess36      0.280
##  3 ManufacturingProcess13      0.246
##  4 ManufacturingProcess37      0.222
##  5 ManufacturingProcess24      0.200
##  6 ManufacturingProcess17      0.199
##  7 ManufacturingProcess28      0.195
##  8 ManufacturingProcess11      0.191
##  9 BiologicalMaterial03        0.188
## 10 ManufacturingProcess23      0.173
## # ℹ 47 more rows
#KNN
vi_knn <- vi(knnTune,
             method = "permute",
             target = train_y,
             metric = "RMSE",
             pred_wrapper = predict)


vi_knn
## # A tibble: 56 × 2
##    Variable               Importance
##    <chr>                       <dbl>
##  1 ManufacturingProcess12     0.0910
##  2 ManufacturingProcess36     0.0821
##  3 ManufacturingProcess32     0.0656
##  4 ManufacturingProcess13     0.0626
##  5 BiologicalMaterial03       0.0583
##  6 BiologicalMaterial01       0.0504
##  7 BiologicalMaterial06       0.0465
##  8 ManufacturingProcess25     0.0409
##  9 BiologicalMaterial09       0.0408
## 10 ManufacturingProcess23     0.0399
## # ℹ 46 more rows

Both MARS and SVM saw manufacturing process 32 as the top variable. Mars only shows three, with manufacturing process 32 as #1 (100), man 09 at 47.8 and everything else at 0. KNN shows manufacturing process 36 at the top and 32 next. The degrees of importance are much lower than MARS and SVM.

Manufacturing variables dominate for all three models. Note that there are more manufacturing than biological variables in general.

How they compare to the linear model: Last week, I checked the top variables for elastic net. Bio 06 is at the top (it doesn’t appear in any of the nonlinear results). The top variable for MARS and SVM (process 32) ranks 7th, and process 36 (top variable for KNN) ranks at #9. Overall, they’re different lists, still dominated by manufacturing variables.

  1. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Sources for some plotting techniques/the plotmo package:

https://bradleyboehmke.github.io/HOML/mars.html#the-basic-idea https://rpubs.com/erblast/mars

Visualizing all three models and their predictor variables first:

#MARS

top_vars <- evimp(marsTuned$finalModel)
top_10_names <- rownames(top_vars)[1:10]

plotmo(marsTuned, which = 1, predict.terms = top_10_names)
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                                 6.355                55.09                67.38
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##                 12.07                18.42                48.46
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##                   100                17.49                12.83
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##                  2.63               146.02                20.06
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##                    11.4                     21                   1.55
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##                     934                 998.75                  206.6
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##                     177                    178                 45.805
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##                     9.1                    9.4                      0
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##                    34.5                 4856.5                 6033.5
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##                    4588                   34.4                   4842
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##                    6024                   4582                   -0.3
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##                       5                      3                      8
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##                    4856                   6047                 4585.5
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##                    10.4                   19.9                   9.15
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##                    70.8                    158                     64
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##                     2.5                  495.5                   0.02
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##                       1                      3                    7.2
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##                       0                      0                   11.6
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##                     0.8                    1.9                    2.2

plotmo(marsTuned, 
       which = 1, 
       subset = top_10_names, 
       caption = "MARS",
       pt.col = "gray")
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                                 6.355                55.09                67.38
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##                 12.07                18.42                48.46
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##                   100                17.49                12.83
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##                  2.63               146.02                20.06
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##                    11.4                     21                   1.55
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##                     934                 998.75                  206.6
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##                     177                    178                 45.805
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##                     9.1                    9.4                      0
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##                    34.5                 4856.5                 6033.5
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##                    4588                   34.4                   4842
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##                    6024                   4582                   -0.3
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##                       5                      3                      8
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##                    4856                   6047                 4585.5
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##                    10.4                   19.9                   9.15
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##                    70.8                    158                     64
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##                     2.5                  495.5                   0.02
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##                       1                      3                    7.2
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##                       0                      0                   11.6
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##                     0.8                    1.9                    2.2

#SVM
plotmo(svmRTuned, 
       which = 1, 
       degree1 = c("ManufacturingProcess32", "ManufacturingProcess36",
                   "ManufacturingProcess37", "ManufacturingProcess17",
                   "ManufacturingProcess11", "ManufacturingProcess13",
                   "ManufacturingProcess28", "ManufacturingProcess34",  
                   "BiologicalMaterial05", "ManufacturingProcess24"),
       caption = "SVM",
       pt.col = "gray")
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                                 6.355                55.09                67.38
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##                 12.07                18.42                48.46
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##                   100                17.49                12.83
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##                  2.63               146.02                20.06
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##                    11.4                     21                   1.55
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##                     934                 998.75                  206.6
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##                     177                    178                 45.805
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##                     9.1                    9.4                      0
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##                    34.5                 4856.5                 6033.5
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##                    4588                   34.4                   4842
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##                    6024                   4582                   -0.3
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##                       5                      3                      8
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##                    4856                   6047                 4585.5
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##                    10.4                   19.9                   9.15
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##                    70.8                    158                     64
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##                     2.5                  495.5                   0.02
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##                       1                      3                    7.2
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##                       0                      0                   11.6
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##                     0.8                    1.9                    2.2

#KNN (looks like it may have overfit)
plotmo(knnTune, 
       which = 1, 
       degree1 = c("ManufacturingProcess36",
                    "ManufacturingProcess13",
                   "ManufacturingProcess17",
                   "ManufacturingProcess12",
                   "BiologicalMaterial02",
                    "ManufacturingProcess22",
                   "BiologicalMaterial01",
                   "ManufacturingProcess07",
                   "ManufacturingProcess09",
                   "ManufacturingProcess32"),
       caption = "KNN",
       pt.col = "gray")
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                                 6.355                55.09                67.38
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##                 12.07                18.42                48.46
##  BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
##                 17.49                12.83                 2.63
##  BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
##                146.02                20.06                   11.4
##  ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
##                      21                   1.55                    934
##  ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
##                  998.75                  206.6                    177
##  ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
##                     178                 45.805                    9.1
##  ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
##                     9.4                      0                   34.5
##  ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
##                  4856.5                 6033.5                   4588
##  ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
##                    34.4                   4842                   6024
##  ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
##                    4582                   -0.3                      5
##  ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
##                       3                      8                   4856
##  ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
##                    6047                 4585.5                   10.4
##  ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
##                    19.9                   9.15                   70.8
##  ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
##                     158                     64                    2.5
##  ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
##                   495.5                   0.02                      1
##  ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
##                       3                    7.2                      0
##  ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
##                       0                   11.6                    0.8
##  ManufacturingProcess44 ManufacturingProcess45
##                     1.9                    2.2

Since MARS only has 3 predictors (one of which the model has assigned a zero value to), I’m pulling the unique predictors for each model, not just the model with the best R-squared.

MARS

col_index <- which(colnames(train_x) == "ManufacturingProcess09")

plotmo(marsTuned$finalModel,
       which = 1,
       caption = "MARS",
       pt.col = "gray",
       degree1 = "ManufacturingProcess09")
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                           -0.08877564           -0.1522094          -0.09892082
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##            -0.1401945          -0.07891408           -0.1332374
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##            -0.1474496           0.02130545          -0.02657837
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##            -0.2418646           -0.1860545              -0.1778
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##               0.1434582              0.5093331              0.4523672
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##                0.383586            -0.08741698             -0.2843345
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##              -0.9822878              0.8755778             0.08980004
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##              -0.1119662             0.03204418             -0.4982111
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##             -0.01004455             0.04847142             -0.0871858
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##             -0.03366167             0.02889183             0.09163441
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##              -0.1084604             0.08252239             -0.1969995
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##               -0.112253            -0.01294419             -0.2008734
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##             -0.02795428            -0.07705969              -0.069302
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##               0.7601398              -0.351222            -0.06190869
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##               0.1423564            -0.08135609              0.1887734
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##              0.07639713            -0.05353648              0.4666874
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##            -0.003159363              0.7421258              0.2112721
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##              -0.4532314             -0.4343483              0.1877572
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##              -0.1172866              0.2908739              0.1331032

Manufacturing process 09 technically isn’t unique, since it appears in KNN too. Here, the predictor correlates with an increase in yield, but levels off after a certain point, as shown by the model.

SVM

plotmo(svmRTuned, 
       which = 1, 
       degree1 = c("ManufacturingProcess37",
                   "ManufacturingProcess11",
                   "ManufacturingProcess28", 
                   "ManufacturingProcess34",  
                   "BiologicalMaterial05", 
                   "ManufacturingProcess24"),
       caption = "SVM",
       pt.col = "gray")
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                                 6.355                55.09                67.38
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##                 12.07                18.42                48.46
##  BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
##                   100                17.49                12.83
##  BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
##                  2.63               146.02                20.06
##  ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
##                    11.4                     21                   1.55
##  ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
##                     934                 998.75                  206.6
##  ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
##                     177                    178                 45.805
##  ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
##                     9.1                    9.4                      0
##  ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
##                    34.5                 4856.5                 6033.5
##  ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
##                    4588                   34.4                   4842
##  ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
##                    6024                   4582                   -0.3
##  ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
##                       5                      3                      8
##  ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
##                    4856                   6047                 4585.5
##  ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
##                    10.4                   19.9                   9.15
##  ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
##                    70.8                    158                     64
##  ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
##                     2.5                  495.5                   0.02
##  ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
##                       1                      3                    7.2
##  ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
##                       0                      0                   11.6
##  ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
##                     0.8                    1.9                    2.2

SVM has six unique predictors, all of which have been assigined somewhat linear relationships. These relationships look fairly weak. The variance looks high and, in some cases, slightly higher values in certain sections of the plot look like they may be due to random chance (noise), but SVM has assigned importance to them. It makes sense that the relationships appear less strong, since we are looking at the model’s unique chosen predictors (i.e., predictors no other model chose).

#leaving 09 in here as well to see how a different model treats it
plotmo(knnTune, 
       which = 1, 
       degree1 = c(
                   "ManufacturingProcess12",
                   "BiologicalMaterial02",
                    "ManufacturingProcess22",
                   "BiologicalMaterial01",
                   "ManufacturingProcess07",
                   "ManufacturingProcess09"),
       caption = "KNN",
       pt.col = "gray")
##  plotmo grid:    BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
##                                 6.355                55.09                67.38
##  BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
##                 12.07                18.42                48.46
##  BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
##                 17.49                12.83                 2.63
##  BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
##                146.02                20.06                   11.4
##  ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
##                      21                   1.55                    934
##  ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
##                  998.75                  206.6                    177
##  ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
##                     178                 45.805                    9.1
##  ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
##                     9.4                      0                   34.5
##  ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
##                  4856.5                 6033.5                   4588
##  ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
##                    34.4                   4842                   6024
##  ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
##                    4582                   -0.3                      5
##  ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
##                       3                      8                   4856
##  ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
##                    6047                 4585.5                   10.4
##  ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
##                    19.9                   9.15                   70.8
##  ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
##                     158                     64                    2.5
##  ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
##                   495.5                   0.02                      1
##  ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
##                       3                    7.2                      0
##  ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
##                       0                   11.6                    0.8
##  ManufacturingProcess44 ManufacturingProcess45
##                     1.9                    2.2

Again, it looks like KNN overfit on these variables. Process 07, for example, shows two similar clusters on the left and right sides of the graph. The model has seemingly assigned an arbitrary dip in the middle of the graph. Process 12 has a similar problem. Bio 01, process 22, and process 01 all show the model overfitting to small changes in the data, dipping and rising in different parts of the graph.

This is a common problem for KNN, particularly when k=1. However, this model selected k = 4. Overfitting can also hapen with data like this, where many of the variables don’t seem to have a strong relationship with yield.

Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Based on the unique predictors, the relationships the models picked for SVM and KNN are not strong, and KNN has overfit/shown relationships that may not be valid.

Looking at all the manufacturing predictors (from the first set of charts), some variables, like process 32 and 36 seem to have a stronger nonlinear correlation with yield. In Mars’s model, I would like more data >37 for process 13, which could help validate the relationship or show a downward slope as x increases.