library(AppliedPredictiveModeling)
library(caret)
library(GGally)
library(mlbench)
library(tibble)
library(tidyverse)
set.seed(314159)
DATA 624: Assignment 08
Setup
Assignment
Do problems 7.2 and 7.5 in Kuhn and Johnson. There are only two but they have many parts. Please submit both a link to your Rpubs and the .rmd file.
Problem 7.2
Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[ y = 10\sin{\pi x_1x_2}+20(x_3−0.5)^2 +10x_4 +5x_5 + N(0,\sigma^2) \]
where the \(x\) values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench
contains a function called mlbench.friedman1
that simulates these data:
library(mlbench)
set.seed(200)
<- mlbench.friedman1(200, sd = 1)
trainingData
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
$x <- data.frame(trainingData$x)
trainingData
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## Let's see what it looks like in a line plot.
## It should be sine like.
data.frame(index = 1:100, trainingData$x, y = trainingData$y) |>
pivot_longer(cols = -index,
names_to = "predictor",
values_to = "value") |>
ggplot(mapping = aes(x = index, y = value, color = predictor)) +
geom_line() +
labs(title = "Friedman1 Data: Predictors") +
facet_wrap(~ predictor, ncol = 2, scale = "free")
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
<- mlbench.friedman1(5000, sd = 1)
testData $x <- data.frame(testData$x) testData
One can sort of see its sine origins. But it also looks noisy. Let’s see what our non-linear models make of it.
Tune several models on these data.
<- function(model) {
makePrediction <- predict(model, newdata = testData$x)
prediction ## The function 'postResample' can be used to get the test set performance values
<- postResample(pred = prediction, obs = testData$y)
estimates return(data.frame(
prediction = prediction,
RMSE = estimates[1]
)) }
KNN
<- train(
knnModel x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10
) knnModel
k-Nearest Neighbors
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.466085 0.5121775 2.816838
7 3.349428 0.5452823 2.727410
9 3.264276 0.5785990 2.660026
11 3.214216 0.6024244 2.603767
13 3.196510 0.6176570 2.591935
15 3.184173 0.6305506 2.577482
17 3.183130 0.6425367 2.567787
19 3.198752 0.6483184 2.592683
21 3.188993 0.6611428 2.588787
23 3.200458 0.6638353 2.604529
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 17.
Our KNN champion is:
Method | RMSE | Params |
---|---|---|
knn | 3.183130 | k=17 |
<- predict(knnModel, newdata = testData$x)
knnPred ## The function 'postResample' can be used to get the test set performance values
postResample(pred = knnPred, obs = testData$y)
RMSE Rsquared MAE
3.2040595 0.6819919 2.5683461
SVR
<- function(method, tuneLength = 5) {
trainSVR # do we want to specify our own control parameters? What is the default?
# ctrl <- trainControl(method = "cv", number = 5)
<- train(
fit x = trainingData$x,
y = trainingData$y,
method = method,
preProcess = c("center", "scale"),
tuneLength = tuneLength
)print(fit)
print(fit$bestTune)
return(fit)
}
<- trainSVR("svmLinear", 1) svmLModel
Support Vector Machines with Linear Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results:
RMSE Rsquared MAE
2.551519 0.7476728 2.01746
Tuning parameter 'C' was held constant at a value of 1
C
1 1
<- trainSVR("svmRadial", 10) svmRModel
Support Vector Machines with Radial Basis Function Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 2.525979 0.7804630 2.016014
0.50 2.293423 0.7960080 1.808878
1.00 2.156969 0.8112034 1.697751
2.00 2.081487 0.8226984 1.631759
4.00 2.050866 0.8270470 1.605581
8.00 2.046703 0.8280421 1.602148
16.00 2.046385 0.8281081 1.601593
32.00 2.046385 0.8281081 1.601593
64.00 2.046385 0.8281081 1.601593
128.00 2.046385 0.8281081 1.601593
Tuning parameter 'sigma' was held constant at a value of 0.06529705
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.06529705 and C = 16.
sigma C
7 0.06529705 16
<- trainSVR("svmPoly", 4) svmPModel
Support Vector Machines with Polynomial Kernel
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
degree scale C RMSE Rsquared MAE
1 0.001 0.25 4.740884 0.6902993 3.893787
1 0.001 0.50 4.574385 0.6934968 3.754844
1 0.001 1.00 4.270128 0.7003019 3.493989
1 0.001 2.00 3.772729 0.7164046 3.080855
1 0.010 0.25 3.585626 0.7180459 2.934032
1 0.010 0.50 2.975363 0.7250589 2.440642
1 0.010 1.00 2.662080 0.7302735 2.155593
1 0.010 2.00 2.588294 0.7323073 2.067295
1 0.100 0.25 2.576129 0.7336613 2.055606
1 0.100 0.50 2.570671 0.7355963 2.037586
1 0.100 1.00 2.584292 0.7351146 2.048364
1 0.100 2.00 2.595063 0.7351536 2.058574
1 1.000 0.25 2.596904 0.7350848 2.060741
1 1.000 0.50 2.598265 0.7355652 2.063759
1 1.000 1.00 2.598357 0.7357710 2.063551
1 1.000 2.00 2.599307 0.7359000 2.064332
2 0.001 0.25 4.574377 0.6935558 3.754853
2 0.001 0.50 4.270092 0.7003607 3.493978
2 0.001 1.00 3.772638 0.7164730 3.080781
2 0.001 2.00 3.162075 0.7217320 2.601215
2 0.010 0.25 2.965569 0.7275450 2.431434
2 0.010 0.50 2.646295 0.7339781 2.140232
2 0.010 1.00 2.549960 0.7403652 2.034452
2 0.010 2.00 2.508628 0.7465640 1.986923
2 0.100 0.25 2.270246 0.7906128 1.740774
2 0.100 0.50 2.214283 0.8036707 1.680506
2 0.100 1.00 2.254140 0.8024618 1.710590
2 0.100 2.00 2.312914 0.7982985 1.768183
2 1.000 0.25 2.448406 0.7825888 1.902525
2 1.000 0.50 2.469659 0.7812319 1.918542
2 1.000 1.00 2.486838 0.7792002 1.932304
2 1.000 2.00 2.489980 0.7788210 1.935774
3 0.001 0.25 4.418001 0.6973432 3.621611
3 0.001 0.50 4.004192 0.7115996 3.268737
3 0.001 1.00 3.423407 0.7188881 2.809099
3 0.001 2.00 2.850229 0.7278810 2.324427
3 0.010 0.25 2.727309 0.7345626 2.214836
3 0.010 0.50 2.545523 0.7442909 2.038521
3 0.010 1.00 2.475986 0.7525542 1.959426
3 0.010 2.00 2.424417 0.7631450 1.895615
3 0.100 0.25 2.155248 0.8114591 1.653112
3 0.100 0.50 2.192186 0.8085547 1.677435
3 0.100 1.00 2.243233 0.8038043 1.717120
3 0.100 2.00 2.270305 0.8016215 1.744910
3 1.000 0.25 3.186492 0.6348075 2.455965
3 1.000 0.50 3.186492 0.6348075 2.455965
3 1.000 1.00 3.186492 0.6348075 2.455965
3 1.000 2.00 3.186492 0.6348075 2.455965
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
degree scale C
41 3 0.1 0.25
So, what do we have? We’ve got three different results (linear, radial and polynomial) for SVR. Let’s see who our champion was:
Method | RMSE | Params |
---|---|---|
svmLinear | 2.551519 | C=1 |
svmRadial | 2.046385 | C=16 , sigma=0.06529705 |
svmPoly | 2.155248 | C=0.25 , degree=3 , scale=0.1 |
Of the three, the radial model has the lowest RMSE.
MARS
<- train(
marsModel x = trainingData$x,
y = trainingData$y,
method = "earth",
preProc = c("center", "scale"),
tuneLength = 10
)
Loading required package: earth
Loading required package: Formula
Loading required package: plotmo
Loading required package: plotrix
print(marsModel)
Multivariate Adaptive Regression Spline
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
nprune RMSE Rsquared MAE
2 4.453623 0.2198309 3.682371
3 3.712197 0.4554337 3.012165
4 2.795658 0.6874873 2.240726
6 2.331132 0.7817116 1.855013
7 1.902458 0.8541971 1.506668
9 1.753020 0.8782210 1.367227
10 1.754192 0.8785378 1.359698
12 1.759946 0.8770912 1.364519
13 1.765846 0.8764817 1.371836
15 1.803247 0.8716558 1.393796
Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 9 and degree = 1.
print(marsModel$bestTune)
nprune degree
6 9 1
Method | RMSE | Params |
---|---|---|
earth | 1.777536 | nprune=13 |
Neural Network
<- train(
nnModel x = trainingData$x,
y = trainingData$y,
method = "nnet",
preProc = c("center", "scale"),
trace = FALSE,
tuneLength = 7
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
print(nnModel)
Neural Network
200 samples
10 predictor
Pre-processing: centered (10), scaled (10)
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results across tuning parameters:
size decay RMSE Rsquared MAE
1 0.0000000000 14.33889 NaN 13.48270
1 0.0001000000 14.33889 0.4876696 13.48270
1 0.0003981072 14.33889 0.5207444 13.48270
1 0.0015848932 14.33889 0.1758176 13.48271
1 0.0063095734 14.33890 0.5208661 13.48272
1 0.0251188643 14.33894 0.7160343 13.48275
1 0.1000000000 14.33906 0.7340019 13.48289
3 0.0000000000 14.33889 NaN 13.48270
3 0.0001000000 14.33889 0.2363548 13.48270
3 0.0003981072 14.33889 0.4152135 13.48270
3 0.0015848932 14.33889 0.1537913 13.48270
3 0.0063095734 14.33890 0.6250552 13.48271
3 0.0251188643 14.33892 0.7306075 13.48274
3 0.1000000000 14.33900 0.7343296 13.48283
5 0.0000000000 14.33889 NaN 13.48270
5 0.0001000000 14.33889 0.2299694 13.48270
5 0.0003981072 14.33889 0.3656188 13.48270
5 0.0015848932 14.33889 0.1526061 13.48270
5 0.0063095734 14.33890 0.5888706 13.48271
5 0.0251188643 14.33891 0.7280735 13.48273
5 0.1000000000 14.33898 0.7350663 13.48280
7 0.0000000000 14.33889 NaN 13.48270
7 0.0001000000 14.33889 0.2439086 13.48270
7 0.0003981072 14.33889 0.3016703 13.48270
7 0.0015848932 14.33889 0.1726897 13.48270
7 0.0063095734 14.33890 0.5889107 13.48271
7 0.0251188643 14.33891 0.7248153 13.48273
7 0.1000000000 14.33897 0.7346983 13.48279
9 0.0000000000 14.33889 NaN 13.48270
9 0.0001000000 14.33889 0.2341736 13.48270
9 0.0003981072 14.33889 0.1673498 13.48270
9 0.0015848932 14.33889 0.1135118 13.48270
9 0.0063095734 14.33889 0.5660738 13.48271
9 0.0251188643 14.33891 0.7199275 13.48272
9 0.1000000000 14.33896 0.7347103 13.48278
11 0.0000000000 14.33889 NaN 13.48270
11 0.0001000000 14.33889 0.1908908 13.48270
11 0.0003981072 14.33889 0.1669535 13.48270
11 0.0015848932 14.33889 0.1335503 13.48270
11 0.0063095734 14.33889 0.5409709 13.48271
11 0.0251188643 14.33891 0.7146450 13.48272
11 0.1000000000 14.33895 0.7350194 13.48277
13 0.0000000000 14.33889 NaN 13.48270
13 0.0001000000 14.33889 0.1539080 13.48270
13 0.0003981072 14.33889 0.1471895 13.48270
13 0.0015848932 14.33889 0.1259564 13.48270
13 0.0063095734 14.33889 0.6228874 13.48271
13 0.0251188643 14.33891 0.7124549 13.48272
13 0.1000000000 14.33895 0.7352840 13.48277
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 1 and decay = 0.
print(nnModel$bestTune)
size decay
1 1 0
That is surprising. For no good reason, I thought a neural network would be competitive, but it’s not.
Method | RMSE | Params |
---|---|---|
nnet | 14.42650 | size=1 , decay=0 |
Answers
Which models appear to give the best performance?
Putting them all together:
Model | Method | RMSE | Params |
---|---|---|---|
KNN | knn | 3.183130 | k=17 |
SVR | svmLinear | 2.551519 | C=1 |
SVR | svmRadial | 2.046385 | C=16 , sigma=0.06529705 |
SVR | svmPoly | 2.155248 | C=0.25 , degree=3 , scale=0.1 |
MARS | earth | 1.777536 | nprune=13 |
nnet | 14.42650 | size=1 , decay=0 |
The best performance appears to be from the MARS model with an RMSE of 1.777536.
Does MARS select the informative predictors (those named \(X1–X5\))?
varImp(marsModel)
earth variable importance
Overall
X1 100.00
X4 82.92
X2 64.47
X5 40.67
X3 28.65
X6 0.00
MARS most certainly did. Exclusively.
Cleanup
Let’s get rid of some of the clutter.
rm(trainingData, testData, knnModel, svmLModel, svmRModel, svmPModel, marsModel, nnModel)
Problem 7.5
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
data("ChemicalManufacturingProcess")
# Impute missing values with the mean of adjacent values
<- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
pp <- predict(pp, ChemicalManufacturingProcess)
df
# get rid of near zero variance predictors
<- nearZeroVar(df, saveMetrics = FALSE)
nz <- df[, -nz]
df
# Split the data into training and test sets
<- df[, -1]
x <- df$Yield
y <- createDataPartition(y, p = 0.7, list = FALSE)
train_index <- x[train_index, ]
train_x <- y[train_index]
train_y <- x[-train_index, ]
test_x <- y[-train_index]
test_y
# Cleanup
rm(pp, df, x, y, train_index)
Now let us setup resampling. We will use cross-validation to evaluate the models.
<- trainControl(
ctrl method = "cv",
number = 5,
returnResamp = "all"
)
Fitting
KNN
<- train(
knnModel x = train_x,
y = train_y,
method = "knn",
preProc = c("center", "scale"),
trControl = ctrl,
tuneLength = 10
)print(knnModel)
k-Nearest Neighbors
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 99, 99, 99, 100, 99
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 0.6895884 0.5118210 0.5630125
7 0.7213101 0.4668344 0.5880421
9 0.7148948 0.4940850 0.5865825
11 0.7101750 0.5071768 0.5774092
13 0.7106957 0.5150371 0.5768799
15 0.7159053 0.5206550 0.5750976
17 0.7243754 0.5232246 0.5889910
19 0.7374226 0.5155476 0.5946109
21 0.7406342 0.5199245 0.5947447
23 0.7512223 0.5097597 0.6024067
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
print(knnModel$bestTune)
k
1 5
SVR
<- function(method, tuneLength = 5) {
trainSVR # do we want to specify our own control parameters? What is the default?
# ctrl <- trainControl(method = "cv", number = 5)
<- train(
fit x = train_x,
y = train_y,
method = method,
preProcess = c("center", "scale"),
trControl = ctrl,
tuneLength = tuneLength
)print(fit)
print(fit$bestTune)
return(fit)
}
<- trainSVR("svmLinear", 1) svmLModel
Support Vector Machines with Linear Kernel
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 97, 100, 100, 100, 99
Resampling results:
RMSE Rsquared MAE
2.853927 0.3292755 1.127083
Tuning parameter 'C' was held constant at a value of 1
C
1 1
<- trainSVR("svmRadial", 10) svmRModel
Support Vector Machines with Radial Basis Function Kernel
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 99, 99, 99, 100, 99
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 0.7738469 0.4669364 0.6315921
0.50 0.7208345 0.4859653 0.5831042
1.00 0.6890652 0.5079438 0.5547469
2.00 0.6635964 0.5312038 0.5399453
4.00 0.6443627 0.5543532 0.5316313
8.00 0.6442760 0.5545981 0.5352342
16.00 0.6459216 0.5530072 0.5382635
32.00 0.6459216 0.5530072 0.5382635
64.00 0.6459216 0.5530072 0.5382635
128.00 0.6459216 0.5530072 0.5382635
Tuning parameter 'sigma' was held constant at a value of 0.01236299
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01236299 and C = 8.
sigma C
6 0.01236299 8
<- trainSVR("svmPoly", 4) svmPModel
Support Vector Machines with Polynomial Kernel
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 98, 99, 100, 100, 99
Resampling results across tuning parameters:
degree scale C RMSE Rsquared MAE
1 0.001 0.25 0.9106419 0.3849392 0.7281315
1 0.001 0.50 0.8616691 0.3975147 0.6980332
1 0.001 1.00 0.8082105 0.4136057 0.6662630
1 0.001 2.00 0.7611620 0.4199080 0.6313126
1 0.010 0.25 0.7643470 0.4090222 0.6259356
1 0.010 0.50 0.7867462 0.4144638 0.6072599
1 0.010 1.00 0.9152588 0.4023962 0.6389675
1 0.010 2.00 1.1851424 0.3833684 0.7008715
1 0.100 0.25 1.3083413 0.3745627 0.7303115
1 0.100 0.50 1.6255470 0.3617986 0.7934839
1 0.100 1.00 2.1657717 0.3261827 0.9086161
1 0.100 2.00 2.6400839 0.2696350 1.0311687
1 1.000 0.25 2.6536636 0.2623733 1.0482108
1 1.000 0.50 2.8187630 0.2456215 1.1061679
1 1.000 1.00 3.0821454 0.2390803 1.1683853
1 1.000 2.00 3.5834123 0.2470449 1.2751313
2 0.001 0.25 0.8612681 0.4069234 0.6967042
2 0.001 0.50 0.8071078 0.4165077 0.6658916
2 0.001 1.00 0.7570117 0.4367414 0.6246751
2 0.001 2.00 0.7587181 0.4434109 0.5985194
2 0.010 0.25 1.0514732 0.3453416 0.6570466
2 0.010 0.50 1.1891133 0.3540641 0.6707408
2 0.010 1.00 1.7145890 0.2902642 0.7891574
2 0.010 2.00 1.9345630 0.2801307 0.8330883
2 0.100 0.25 4.9842790 0.2587619 1.4734784
2 0.100 0.50 5.5729776 0.2540084 1.5928851
2 0.100 1.00 5.5729776 0.2540084 1.5928851
2 0.100 2.00 5.5729776 0.2540084 1.5928851
2 1.000 0.25 6.6374691 0.1881592 1.8621289
2 1.000 0.50 6.6374691 0.1881592 1.8621289
2 1.000 1.00 6.6374691 0.1881592 1.8621289
2 1.000 2.00 6.6374691 0.1881592 1.8621289
3 0.001 0.25 0.8354211 0.3633199 0.6879510
3 0.001 0.50 0.7714611 0.4341537 0.6400390
3 0.001 1.00 0.7455086 0.4518126 0.6028351
3 0.001 2.00 0.8309080 0.3847034 0.6142516
3 0.010 0.25 15.4474754 0.2870653 3.5569843
3 0.010 0.50 21.0469213 0.2757707 4.6848753
3 0.010 1.00 24.2540434 0.2586997 5.3343507
3 0.010 2.00 31.8667519 0.2595499 6.8544389
3 0.100 0.25 119.8764181 0.2484157 24.7404900
3 0.100 0.50 119.8764181 0.2484157 24.7404900
3 0.100 1.00 119.8764181 0.2484157 24.7404900
3 0.100 2.00 119.8764181 0.2484157 24.7404900
3 1.000 0.25 49.9475146 0.2073258 10.8534458
3 1.000 0.50 49.9475146 0.2073258 10.8534458
3 1.000 1.00 49.9475146 0.2073258 10.8534458
3 1.000 2.00 49.9475146 0.2073258 10.8534458
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were degree = 3, scale = 0.001 and C = 1.
degree scale C
35 3 0.001 1
print(svmLModel)
Support Vector Machines with Linear Kernel
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 97, 100, 100, 100, 99
Resampling results:
RMSE Rsquared MAE
2.853927 0.3292755 1.127083
Tuning parameter 'C' was held constant at a value of 1
print(svmLModel$bestTune)
C
1 1
print(svmRModel)
Support Vector Machines with Radial Basis Function Kernel
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 99, 99, 99, 100, 99
Resampling results across tuning parameters:
C RMSE Rsquared MAE
0.25 0.7738469 0.4669364 0.6315921
0.50 0.7208345 0.4859653 0.5831042
1.00 0.6890652 0.5079438 0.5547469
2.00 0.6635964 0.5312038 0.5399453
4.00 0.6443627 0.5543532 0.5316313
8.00 0.6442760 0.5545981 0.5352342
16.00 0.6459216 0.5530072 0.5382635
32.00 0.6459216 0.5530072 0.5382635
64.00 0.6459216 0.5530072 0.5382635
128.00 0.6459216 0.5530072 0.5382635
Tuning parameter 'sigma' was held constant at a value of 0.01236299
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were sigma = 0.01236299 and C = 8.
print(svmRModel$bestTune)
sigma C
6 0.01236299 8
print(svmPModel)
Support Vector Machines with Polynomial Kernel
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 98, 99, 100, 100, 99
Resampling results across tuning parameters:
degree scale C RMSE Rsquared MAE
1 0.001 0.25 0.9106419 0.3849392 0.7281315
1 0.001 0.50 0.8616691 0.3975147 0.6980332
1 0.001 1.00 0.8082105 0.4136057 0.6662630
1 0.001 2.00 0.7611620 0.4199080 0.6313126
1 0.010 0.25 0.7643470 0.4090222 0.6259356
1 0.010 0.50 0.7867462 0.4144638 0.6072599
1 0.010 1.00 0.9152588 0.4023962 0.6389675
1 0.010 2.00 1.1851424 0.3833684 0.7008715
1 0.100 0.25 1.3083413 0.3745627 0.7303115
1 0.100 0.50 1.6255470 0.3617986 0.7934839
1 0.100 1.00 2.1657717 0.3261827 0.9086161
1 0.100 2.00 2.6400839 0.2696350 1.0311687
1 1.000 0.25 2.6536636 0.2623733 1.0482108
1 1.000 0.50 2.8187630 0.2456215 1.1061679
1 1.000 1.00 3.0821454 0.2390803 1.1683853
1 1.000 2.00 3.5834123 0.2470449 1.2751313
2 0.001 0.25 0.8612681 0.4069234 0.6967042
2 0.001 0.50 0.8071078 0.4165077 0.6658916
2 0.001 1.00 0.7570117 0.4367414 0.6246751
2 0.001 2.00 0.7587181 0.4434109 0.5985194
2 0.010 0.25 1.0514732 0.3453416 0.6570466
2 0.010 0.50 1.1891133 0.3540641 0.6707408
2 0.010 1.00 1.7145890 0.2902642 0.7891574
2 0.010 2.00 1.9345630 0.2801307 0.8330883
2 0.100 0.25 4.9842790 0.2587619 1.4734784
2 0.100 0.50 5.5729776 0.2540084 1.5928851
2 0.100 1.00 5.5729776 0.2540084 1.5928851
2 0.100 2.00 5.5729776 0.2540084 1.5928851
2 1.000 0.25 6.6374691 0.1881592 1.8621289
2 1.000 0.50 6.6374691 0.1881592 1.8621289
2 1.000 1.00 6.6374691 0.1881592 1.8621289
2 1.000 2.00 6.6374691 0.1881592 1.8621289
3 0.001 0.25 0.8354211 0.3633199 0.6879510
3 0.001 0.50 0.7714611 0.4341537 0.6400390
3 0.001 1.00 0.7455086 0.4518126 0.6028351
3 0.001 2.00 0.8309080 0.3847034 0.6142516
3 0.010 0.25 15.4474754 0.2870653 3.5569843
3 0.010 0.50 21.0469213 0.2757707 4.6848753
3 0.010 1.00 24.2540434 0.2586997 5.3343507
3 0.010 2.00 31.8667519 0.2595499 6.8544389
3 0.100 0.25 119.8764181 0.2484157 24.7404900
3 0.100 0.50 119.8764181 0.2484157 24.7404900
3 0.100 1.00 119.8764181 0.2484157 24.7404900
3 0.100 2.00 119.8764181 0.2484157 24.7404900
3 1.000 0.25 49.9475146 0.2073258 10.8534458
3 1.000 0.50 49.9475146 0.2073258 10.8534458
3 1.000 1.00 49.9475146 0.2073258 10.8534458
3 1.000 2.00 49.9475146 0.2073258 10.8534458
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were degree = 3, scale = 0.001 and C = 1.
print(svmPModel$bestTune)
degree scale C
35 3 0.001 1
MARS
<- train(
marsModel x = train_x,
y = train_y,
method = "earth",
preProc = c("center", "scale"),
trControl = ctrl,
tuneLength = 10
)print(marsModel)
Multivariate Adaptive Regression Spline
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 100, 99, 98, 99, 100
Resampling results across tuning parameters:
nprune RMSE Rsquared MAE
2 0.7271135 0.4591539 0.5780916
3 0.7182703 0.4896215 0.5709093
5 0.6681445 0.5419800 0.5213217
7 2.9453922 0.3857181 1.0154052
8 3.1478889 0.4316809 1.0283935
10 3.1797305 0.3887363 1.0624880
12 3.1466059 0.4083874 1.0460210
13 3.1616509 0.3970114 1.0587304
15 3.1609251 0.3997024 1.0593647
17 3.1609251 0.3997024 1.0593647
Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 5 and degree = 1.
print(marsModel$bestFit)
NULL
Neural Network
<- train(
nnModel x = train_x,
y = train_y,
method = "nnet",
preProc = c("center", "scale"),
trace = FALSE,
tuneLength = 7,
trControl = ctrl
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
print(nnModel)
Neural Network
124 samples
56 predictor
Pre-processing: centered (56), scaled (56)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 99, 99, 100, 99, 99
Resampling results across tuning parameters:
size decay RMSE Rsquared MAE
1 0.0000000000 0.9682669 NaN 0.7873824
1 0.0001000000 0.9150490 0.3653951 0.7207956
1 0.0003981072 0.8755967 0.3371277 0.6952522
1 0.0015848932 0.8334053 0.4289406 0.6628271
1 0.0063095734 0.8114338 0.4595784 0.6414999
1 0.0251188643 0.8362465 0.4254025 0.6710512
1 0.1000000000 0.8301152 0.4469966 0.6622558
3 0.0000000000 0.9682424 0.3804808 0.7873682
3 0.0001000000 0.8443653 0.4364528 0.6761996
3 0.0003981072 0.7899455 0.4966880 0.6254486
3 0.0015848932 0.8517819 0.3514565 0.6790311
3 0.0063095734 0.8252297 0.4245470 0.6520447
3 0.0251188643 0.8232849 0.4635433 0.6581931
3 0.1000000000 0.8244168 0.4502662 0.6552868
5 0.0000000000 0.9682669 NaN 0.7873824
5 0.0001000000 0.8378285 0.4562766 0.6724116
5 0.0003981072 0.8086785 0.4689954 0.6450542
5 0.0015848932 0.8299925 0.4234782 0.6608541
5 0.0063095734 0.8222530 0.4407092 0.6544801
5 0.0251188643 0.8181780 0.4663482 0.6572138
5 0.1000000000 0.8199729 0.4571137 0.6518002
7 0.0000000000 0.9682668 0.1120367 0.7873824
7 0.0001000000 0.8505934 0.4688188 0.6708782
7 0.0003981072 0.8244911 0.4023881 0.6477988
7 0.0015848932 0.8198216 0.4436599 0.6563836
7 0.0063095734 0.8133107 0.4686115 0.6576149
7 0.0251188643 0.8095148 0.4972942 0.6428775
7 0.1000000000 0.8221814 0.4519543 0.6525025
9 0.0000000000 0.9682669 NaN 0.7873824
9 0.0001000000 0.8336860 0.3973938 0.6599252
9 0.0003981072 0.8026328 0.4572917 0.6366140
9 0.0015848932 0.8170857 0.4452725 0.6481996
9 0.0063095734 0.8176285 0.4530313 0.6521788
9 0.0251188643 0.8124691 0.4781019 0.6476017
9 0.1000000000 0.8225707 0.4466223 0.6518940
11 0.0000000000 0.9682669 NaN 0.7873824
11 0.0001000000 0.8066660 0.4484055 0.6355455
11 0.0003981072 0.8365208 0.3695474 0.6724226
11 0.0015848932 0.8226359 0.4259208 0.6531413
11 0.0063095734 0.8183817 0.4594450 0.6534696
11 0.0251188643 0.8097759 0.4915198 0.6540279
11 0.1000000000 0.8231418 0.4430660 0.6524492
13 0.0000000000 0.9682669 NaN 0.7873824
13 0.0001000000 0.8218718 0.4043562 0.6512450
13 0.0003981072 0.8286873 0.4045652 0.6417256
13 0.0015848932 0.8199272 0.4439009 0.6500867
13 0.0063095734 0.8269164 0.4184251 0.6533317
13 0.0251188643 0.8245829 0.4269047 0.6526659
13 0.1000000000 0.8217071 0.4485951 0.6529192
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 3 and decay = 0.0003981072.
print(nnModel$bestTune)
size decay
10 3 0.0003981072
7.5.a
Which nonlinear regression model gives the optimal resampling and test set performance?
Let’s try and automate this and see if we can pluck the optimal resampling model out of our set of models.
<- list(
models KNN = knnModel,
SVMLinear = svmLModel,
SVMRadial = svmRModel,
SVMPoly = svmPModel,
MARS = marsModel,
NNEt = nnModel
)
<- tibble(Model = names(models), ParamNames=NA, ParamValues = NA, RMSE = NA, RSquared = NA)
results
for (i in seq(1, length(models))) {
<- models[[i]]
model $RMSE[i] <- min(model$results$RMSE)
results# Hmmm, I'm presuming that the lowest RMSE is also the best fit...
$ParamNames[i] = names(model$bestTune)
results$ParamValues[i] = as_vector(model$bestTune)
results
}
<- arrange(results, RMSE)
results print(results)
# A tibble: 6 × 5
Model ParamNames ParamValues RMSE RSquared
<chr> <chr> <dbl> <dbl> <lgl>
1 SVMRadial sigma 0.0124 0.644 NA
2 MARS nprune 5 0.668 NA
3 KNN k 5 0.690 NA
4 SVMPoly degree 3 0.746 NA
5 NNEt size 3 0.790 NA
6 SVMLinear C 1 2.85 NA
Wow, it was a close race. Our champion, SVM with a radial kernel: \(\sigma=0.0124\), weighs in with an RMSE of \(~0.6443\).
7.5.b
Which predictors are most important in the optimal nonlinear regression model?
varImp(svmRModel)
loess r-squared variable importance
only 20 most important variables shown (out of 56)
Overall
ManufacturingProcess32 100.00
BiologicalMaterial06 80.33
BiologicalMaterial02 72.54
ManufacturingProcess13 70.64
ManufacturingProcess36 60.52
BiologicalMaterial03 60.25
ManufacturingProcess31 58.55
ManufacturingProcess09 54.30
BiologicalMaterial04 53.08
BiologicalMaterial12 51.95
ManufacturingProcess17 51.34
ManufacturingProcess02 50.26
ManufacturingProcess29 49.66
BiologicalMaterial01 45.83
ManufacturingProcess33 43.46
BiologicalMaterial08 35.54
BiologicalMaterial11 34.29
ManufacturingProcess06 33.04
ManufacturingProcess04 31.17
ManufacturingProcess11 30.28
As you can see, the most important is ManufacturingProcess32
, but ManufacturingProcessing isn’t dominating the way it did with linear regression. Of the first 10 most important predictors, 5 of them are BiologicalMaterial.
Do either the biological or process variables dominate the list?
Welly, welly, I jumped the gun. Let’s consider the entire list this time. Of the first 20, 8 of them are BiologicalMaterial and 12 are ManufacturingProcess. It would appear that ManufacturingProcess is still dominating the list, and they are. But, considering that there are far less BiologicalMaterial predictors than ManufacturingProcess predictors, one could say that BiologicalMaterial is relatively dominating the list:
\[ \frac{8}{12} \gt \frac{12}{37} \]
And if we take their weighted sum of importance (let’s restrict it to the first 10)?
= sum(100, 70.63862, 60.51993, 58.54715, 54.30407)
manufacturing = sum(80.33204, 72.54240, 60.25020, 53.08210, 51.94877)
biological manufacturing
[1] 344.0098
biological
[1] 318.1555
Manufacturing comes out on top. In conclusion, I wouldn’t say that either is dominating, but we can say that ManufacturingProcess has a greater influence on the ranking of predictor’s importance to SVM radial than BiologicalMaterial.
How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
BiologicalMaterial has come a long way. If we look back to linear regressions rankings we will see the first 20 ranked as follows:
Predictor | Rank | Index |
---|---|---|
ManufacturingProcess32 | 100.0000000 | 1 |
ManufacturingProcess04 | 84.2112599 | 2 |
ManufacturingProcess28 | 80.4631816 | 3 |
ManufacturingProcess43 | 64.8538418 | 4 |
ManufacturingProcess33 | 62.4884204 | 5 |
ManufacturingProcess12 | 57.8543025 | 6 |
ManufacturingProcess35 | 38.2649373 | 7 |
ManufacturingProcess37 | 38.0256614 | 8 |
BiologicalMaterial11 | 37.2042557 | 9 |
ManufacturingProcess10 | 36.5106556 | 10 |
ManufacturingProcess18 | 34.2042194 | 11 |
ManufacturingProcess20 | 33.1341354 | 12 |
ManufacturingProcess29 | 32.2528714 | 13 |
BiologicalMaterial02 | 31.0298498 | 14 |
ManufacturingProcess05 | 30.7444067 | 15 |
ManufacturingProcess26 | 30.1456411 | 16 |
ManufacturingProcess13 | 29.5391977 | 17 |
ManufacturingProcess25 | 28.3191327 | 18 |
ManufacturingProcess11 | 27.9038396 | 19 |
ManufacturingProcess24 | 27.6671361 | 20 |
Vs. non-linear regression (SVM)
varImp(svmRModel)$importance |>
arrange(desc(Overall)) |>
mutate(Index = row_number()) |>
head(n = 20)
Overall Index
ManufacturingProcess32 100.00000 1
BiologicalMaterial06 80.33204 2
BiologicalMaterial02 72.54240 3
ManufacturingProcess13 70.63862 4
ManufacturingProcess36 60.51993 5
BiologicalMaterial03 60.25020 6
ManufacturingProcess31 58.54715 7
ManufacturingProcess09 54.30407 8
BiologicalMaterial04 53.08210 9
BiologicalMaterial12 51.94877 10
ManufacturingProcess17 51.33595 11
ManufacturingProcess02 50.26472 12
ManufacturingProcess29 49.66376 13
BiologicalMaterial01 45.83156 14
ManufacturingProcess33 43.46117 15
BiologicalMaterial08 35.53841 16
BiologicalMaterial11 34.28960 17
ManufacturingProcess06 33.03744 18
ManufacturingProcess04 31.16695 19
ManufacturingProcess11 30.27655 20
We see a much greater contribution of BiologicalMaterial in non-linear regression than we did with linear regression. There are only two BiologicalMaterial predictors in the first 20 variables of importance for linear regression.
7.5.c
Explore relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Let’s isolate the top 10 predictors
<- varImp(svmRModel)$importance |>
df arrange(desc(Overall)) |>
slice_head(n = 10)
<- ChemicalManufacturingProcess[rownames(df)]
df # Let's shorten the predictor names. They are obscuring data in our correlation plot.
names(df) <- c("MP32", "BM06", "BM02", "MP13", "MP36", "BM03", "MP31", "MP09", "BM04", "BM12")
# Add yield as `Y`
$Y <- ChemicalManufacturingProcess$Yield df
Now let’s have a look at our data using GGally. We’ll use a heat-map to find the correlation between all pairs in our set of variables.
::ggcorr(df, label = TRUE) GGally
Interesting! The predictor with the highest correlation with yield is also ranked the most important predictor (MP32). And the absolute value of the second (BM06), third (BM02), fourth (MP13) and fifth (MP36) all have a rounded correlation of \(0.5\) with Yield. This is a good sign.
Also, there is a lot of correlation between some of our predictors. Let’s identify the top 5.
cor(df$BM02, df$BM06)
[1] 0.9543113
cor(df$BM02, df$BM03)
[1] 0.8607901
cor(df$BM06, df$BM03)
[1] 0.8723637
cor(df$BM06, df$BM12)
[1] 0.812854
cor(df$BM02, df$BM12)
[1] 0.7793419
We very likely could have used PCA to reduce the dimensionality of our data. But we are doing the pre-processing during post-processing.
Before we wrap it up, I’d like to see some of the correlation. We tried to plot them all in one plot, but it was too crowded. So we faceted the data. And we normalized it to make it easier to see the relationships.
|>
df mutate(
Index = row_number(),
MP32 = (MP32-min(MP32)) / (max(MP32) - min(MP32)),
BM06 = (BM06-min(BM06)) / (max(BM06) - min(BM06)),
BM02 = (BM02-min(BM02)) / (max(BM02) - min(BM02)),
MP13 = (MP13-min(MP13)) / (max(MP13) - min(MP13)),
MP36 = (MP36-min(MP36)) / (max(MP36) - min(MP36)),
BM03 = (BM03-min(BM03)) / (max(BM03) - min(BM03)),
MP31 = (MP31-min(MP31)) / (max(MP31) - min(MP31)),
MP09 = (MP09-min(MP09)) / (max(MP09) - min(MP09)),
BM04 = (BM04-min(BM04)) / (max(BM04) - min(BM04)),
BM12 = (BM12-min(BM12)) / (max(BM12) - min(BM12)),
Y = (Y-min(Y)) / (max(Y) - min(Y))
|>
) pivot_longer(cols = -Index, names_to = "Variable", values_to = "Value") |>
ggplot(mapping = aes(x = Index, y = Value, color = Variable)) +
geom_line() +
labs(title = "Top 10 Predictors and Yield") +
facet_wrap(~ Variable, ncol = 2)
Warning: Removed 352 rows containing missing values or values outside the scale range
(`geom_line()`).
Where are those missing values coming from!? Oh geez, I pulled the 10 top predictors from ChemicalManufacturingProcess
. This was a mistake, but at the same time I think it’s a little fortuitous. Our correlation calculations, normalization and plotting is based on observed data.
Let’s count our missing values in df
.
|>
df summarise(across(everything(), ~ sum(is.na(.))))
MP32 BM06 BM02 MP13 MP36 BM03 MP31 MP09 BM04 BM12 Y
1 0 0 0 0 5 0 5 0 0 0 0
And that is what our plot is showing us. We can live with that. Especially being that we need to turn it in :).