7.2 and 7.5
7.2. Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data:
y = 10sin(πx1x2)+20(x3−0.5)2 +10x4 +5x5 +N(0,σ2)
where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
# of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Tune several models on these data. For example:
knnModel <- train(x = trainingData$x,
y = trainingData$y, method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
library(earth)
## Warning: package 'earth' was built under R version 4.5.3
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.5.2
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.5.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.5.2
marsFit <- earth(trainingData$x, trainingData$y)
marsFit
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
summary(marsFit)
## Call: earth(x=trainingData$x, y=trainingData$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
# Explicitly declare the candidate models to test
marsTuned <- train(trainingData$x, trainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.327937 0.2544880 3.6004742
## 1 3 3.572450 0.4912720 2.8958113
## 1 4 2.596841 0.7183600 2.1063410
## 1 5 2.370161 0.7659777 1.9186686
## 1 6 2.276141 0.7881481 1.8100006
## 1 7 1.766728 0.8751831 1.3902146
## 1 8 1.780946 0.8723243 1.4013449
## 1 9 1.665091 0.8819775 1.3255147
## 1 10 1.663804 0.8821283 1.3276573
## 1 11 1.657738 0.8822967 1.3317299
## 1 12 1.653784 0.8827903 1.3315041
## 1 13 1.648496 0.8823663 1.3164065
## 1 14 1.639073 0.8841742 1.3128329
## 1 15 1.639073 0.8841742 1.3128329
## 1 16 1.639073 0.8841742 1.3128329
## 1 17 1.639073 0.8841742 1.3128329
## 1 18 1.639073 0.8841742 1.3128329
## 1 19 1.639073 0.8841742 1.3128329
## 1 20 1.639073 0.8841742 1.3128329
## 1 21 1.639073 0.8841742 1.3128329
## 1 22 1.639073 0.8841742 1.3128329
## 1 23 1.639073 0.8841742 1.3128329
## 1 24 1.639073 0.8841742 1.3128329
## 1 25 1.639073 0.8841742 1.3128329
## 1 26 1.639073 0.8841742 1.3128329
## 1 27 1.639073 0.8841742 1.3128329
## 1 28 1.639073 0.8841742 1.3128329
## 1 29 1.639073 0.8841742 1.3128329
## 1 30 1.639073 0.8841742 1.3128329
## 1 31 1.639073 0.8841742 1.3128329
## 1 32 1.639073 0.8841742 1.3128329
## 1 33 1.639073 0.8841742 1.3128329
## 1 34 1.639073 0.8841742 1.3128329
## 1 35 1.639073 0.8841742 1.3128329
## 1 36 1.639073 0.8841742 1.3128329
## 1 37 1.639073 0.8841742 1.3128329
## 1 38 1.639073 0.8841742 1.3128329
## 2 2 4.327937 0.2544880 3.6004742
## 2 3 3.572450 0.4912720 2.8958113
## 2 4 2.661826 0.7070510 2.1734709
## 2 5 2.404015 0.7578971 1.9753867
## 2 6 2.243927 0.7914805 1.7830717
## 2 7 1.856336 0.8605482 1.4356822
## 2 8 1.754607 0.8763186 1.3968406
## 2 9 1.653859 0.8870129 1.2813884
## 2 10 1.434159 0.9166537 1.1339203
## 2 11 1.320482 0.9289120 1.0347278
## 2 12 1.317547 0.9306879 1.0359899
## 2 13 1.296910 0.9306902 1.0146112
## 2 14 1.221407 0.9395223 0.9631486
## 2 15 1.230516 0.9390469 0.9761484
## 2 16 1.236911 0.9387407 0.9745362
## 2 17 1.236911 0.9387407 0.9745362
## 2 18 1.236911 0.9387407 0.9745362
## 2 19 1.236911 0.9387407 0.9745362
## 2 20 1.236911 0.9387407 0.9745362
## 2 21 1.236911 0.9387407 0.9745362
## 2 22 1.236911 0.9387407 0.9745362
## 2 23 1.236911 0.9387407 0.9745362
## 2 24 1.236911 0.9387407 0.9745362
## 2 25 1.236911 0.9387407 0.9745362
## 2 26 1.236911 0.9387407 0.9745362
## 2 27 1.236911 0.9387407 0.9745362
## 2 28 1.236911 0.9387407 0.9745362
## 2 29 1.236911 0.9387407 0.9745362
## 2 30 1.236911 0.9387407 0.9745362
## 2 31 1.236911 0.9387407 0.9745362
## 2 32 1.236911 0.9387407 0.9745362
## 2 33 1.236911 0.9387407 0.9745362
## 2 34 1.236911 0.9387407 0.9745362
## 2 35 1.236911 0.9387407 0.9745362
## 2 36 1.236911 0.9387407 0.9745362
## 2 37 1.236911 0.9387407 0.9745362
## 2 38 1.236911 0.9387407 0.9745362
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
head(predict(marsTuned, trainingData$x))
## y
## [1,] 18.03158
## [2,] 15.61875
## [3,] 17.74888
## [4,] 12.26653
## [5,] 19.74822
## [6,] 20.60847
#checking the most important variables
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
MarsPred <- predict(marsTuned, testData$x)
postResample(pred = MarsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2779993 0.9338365 1.0147070
MARS’ R-Squared is .9334, which means it explains most of the variance.
library(kernlab)
## Warning: package 'kernlab' was built under R version 4.5.2
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
set.seed(1122)
svmRTuned <- train(trainingData$x, trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.464385 0.8140734 1.999374
## 0.50 2.215541 0.8277944 1.792697
## 1.00 2.030869 0.8484840 1.638046
## 2.00 1.887732 0.8655903 1.492305
## 4.00 1.762290 0.8793559 1.396959
## 8.00 1.706200 0.8847200 1.362491
## 16.00 1.706837 0.8857520 1.376093
## 32.00 1.706601 0.8858170 1.376046
## 64.00 1.706601 0.8858170 1.376046
## 128.00 1.706601 0.8858170 1.376046
## 256.00 1.706601 0.8858170 1.376046
## 512.00 1.706601 0.8858170 1.376046
## 1024.00 1.706601 0.8858170 1.376046
## 2048.00 1.706601 0.8858170 1.376046
##
## Tuning parameter 'sigma' was held constant at a value of 0.06096343
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06096343 and C = 8.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 8
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0609634348804832
##
## Number of Support Vectors : 154
##
## Objective Function Value : -75.871
## Training error : 0.00936
head(predict(svmRTuned, testData$x))
## [1] 19.210924 21.859312 13.012286 7.552461 12.553335 14.018247
svmPred <- predict(svmRTuned, testData$x)
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0462947 0.8302912 1.5521846
RMSE is slightly higher than MARS at 2.04 and r-squared is a little lower at .83, but still explains a lot of the variance, and more than KNN.
Checking the x values with trainingData$x, they appear to be on a similar scale (all from 0-.99, so I will use the nnet package.
I went with the paramaters from the book, including 5 hidden variables.
library(nnet)
set.seed(100)
nnetFit <- nnet(trainingData$x, trainingData$y, size = 5,
decay = .01,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nnPred <- predict(nnetFit, testData$x)
postResample(pred = nnPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.7850964 0.7033146 2.1812067
RMSE is 2.78 and R-squared is .70. Let’s try with the caret package, which seems to offer more help with tuning.
nnetGrid <- expand.grid(.size = c(1, 3, 5, 7, 10),
.decay = c(0, 0.01, 0.1, 1))
set.seed(1122)
nnetTune <- train(trainingData$x, trainingData$y,
method = "nnet",
preProc = c("center", "scale"),
tuneGrid = nnetGrid,
trControl = trainControl(method = "cv"),
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1)
# which size is best
nnetTune$bestTune
## size decay
## 12 5 1
Shows a size of 5 (5 hidden predictors), and a decay of 1 instead of .1. the fit:
nnPred <- predict(nnetTune, testData$x)
postResample(pred = nnPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.7217986 0.7036043 2.1329360
This model offers a similar RMSE and r-squared to the model with decay .01.
Running it again with the nnet package and corrected decay:
set.seed(100)
nnetFit <- nnet(trainingData$x, trainingData$y, size = 5,
decay = 1,
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = 5 * (ncol(trainingData$x) + 1) + 5 + 1)
nnPred <- predict(nnetFit, testData$x)
postResample(pred = nnPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.2106642 0.8027398 1.6834260
Running it again with the nnet package and the correct decay, we get a slightly better RMSE of 2.21 and R-squared of .803.
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
MARS seems to have the best performance, with the highest R-squared at around .9. The RMSE is pretty low, so it appears to be the best fit overall.
Mars selects X1 - X5 (though not in order):
varImp(marsTuned)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(ChemicalManufacturingProcess)
#check the NA values
colSums(is.na(ChemicalManufacturingProcess))
## Yield BiologicalMaterial01 BiologicalMaterial02
## 0 0 0
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 0 0 0
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 0 0 0
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 0 0 0
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 0 1 3
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 15 1 1
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 2 1 1
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 0 9 10
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0 1
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 0 0 0
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 0 0 0
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 0 1 1
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 5 5
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 5 5 5
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 5 5 0
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 5 5 5
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 5 0 0
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 0 1 1
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 0 0 0
## ManufacturingProcess45
## 0
#VIM package for simple imputation
library(VIM)
## Warning: package 'VIM' was built under R version 4.5.3
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 4.5.3
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
set.seed(1122)
chem_imputed <- kNN(ChemicalManufacturingProcess, k = 5)
#check that there are no missing values
colSums(is.na(chem_imputed))
## Yield BiologicalMaterial01
## 0 0
## BiologicalMaterial02 BiologicalMaterial03
## 0 0
## BiologicalMaterial04 BiologicalMaterial05
## 0 0
## BiologicalMaterial06 BiologicalMaterial07
## 0 0
## BiologicalMaterial08 BiologicalMaterial09
## 0 0
## BiologicalMaterial10 BiologicalMaterial11
## 0 0
## BiologicalMaterial12 ManufacturingProcess01
## 0 0
## ManufacturingProcess02 ManufacturingProcess03
## 0 0
## ManufacturingProcess04 ManufacturingProcess05
## 0 0
## ManufacturingProcess06 ManufacturingProcess07
## 0 0
## ManufacturingProcess08 ManufacturingProcess09
## 0 0
## ManufacturingProcess10 ManufacturingProcess11
## 0 0
## ManufacturingProcess12 ManufacturingProcess13
## 0 0
## ManufacturingProcess14 ManufacturingProcess15
## 0 0
## ManufacturingProcess16 ManufacturingProcess17
## 0 0
## ManufacturingProcess18 ManufacturingProcess19
## 0 0
## ManufacturingProcess20 ManufacturingProcess21
## 0 0
## ManufacturingProcess22 ManufacturingProcess23
## 0 0
## ManufacturingProcess24 ManufacturingProcess25
## 0 0
## ManufacturingProcess26 ManufacturingProcess27
## 0 0
## ManufacturingProcess28 ManufacturingProcess29
## 0 0
## ManufacturingProcess30 ManufacturingProcess31
## 0 0
## ManufacturingProcess32 ManufacturingProcess33
## 0 0
## ManufacturingProcess34 ManufacturingProcess35
## 0 0
## ManufacturingProcess36 ManufacturingProcess37
## 0 0
## ManufacturingProcess38 ManufacturingProcess39
## 0 0
## ManufacturingProcess40 ManufacturingProcess41
## 0 0
## ManufacturingProcess42 ManufacturingProcess43
## 0 0
## ManufacturingProcess44 ManufacturingProcess45
## 0 0
## Yield_imp BiologicalMaterial01_imp
## 0 0
## BiologicalMaterial02_imp BiologicalMaterial03_imp
## 0 0
## BiologicalMaterial04_imp BiologicalMaterial05_imp
## 0 0
## BiologicalMaterial06_imp BiologicalMaterial07_imp
## 0 0
## BiologicalMaterial08_imp BiologicalMaterial09_imp
## 0 0
## BiologicalMaterial10_imp BiologicalMaterial11_imp
## 0 0
## BiologicalMaterial12_imp ManufacturingProcess01_imp
## 0 0
## ManufacturingProcess02_imp ManufacturingProcess03_imp
## 0 0
## ManufacturingProcess04_imp ManufacturingProcess05_imp
## 0 0
## ManufacturingProcess06_imp ManufacturingProcess07_imp
## 0 0
## ManufacturingProcess08_imp ManufacturingProcess09_imp
## 0 0
## ManufacturingProcess10_imp ManufacturingProcess11_imp
## 0 0
## ManufacturingProcess12_imp ManufacturingProcess13_imp
## 0 0
## ManufacturingProcess14_imp ManufacturingProcess15_imp
## 0 0
## ManufacturingProcess16_imp ManufacturingProcess17_imp
## 0 0
## ManufacturingProcess18_imp ManufacturingProcess19_imp
## 0 0
## ManufacturingProcess20_imp ManufacturingProcess21_imp
## 0 0
## ManufacturingProcess22_imp ManufacturingProcess23_imp
## 0 0
## ManufacturingProcess24_imp ManufacturingProcess25_imp
## 0 0
## ManufacturingProcess26_imp ManufacturingProcess27_imp
## 0 0
## ManufacturingProcess28_imp ManufacturingProcess29_imp
## 0 0
## ManufacturingProcess30_imp ManufacturingProcess31_imp
## 0 0
## ManufacturingProcess32_imp ManufacturingProcess33_imp
## 0 0
## ManufacturingProcess34_imp ManufacturingProcess35_imp
## 0 0
## ManufacturingProcess36_imp ManufacturingProcess37_imp
## 0 0
## ManufacturingProcess38_imp ManufacturingProcess39_imp
## 0 0
## ManufacturingProcess40_imp ManufacturingProcess41_imp
## 0 0
## ManufacturingProcess42_imp ManufacturingProcess43_imp
## 0 0
## ManufacturingProcess44_imp ManufacturingProcess45_imp
## 0 0
chem_final <- chem_imputed |>
dplyr::select(-ends_with("_imp"))
#leaving out the part where I checked for NZV becuase I didn't end up removing any
#variables
set.seed(1122)
train_index <- sample(nrow(chem_final), 0.8 * nrow(chem_final))
#train and test sets (yield + predictors are all in the same df)
train_y <- chem_final$Yield[train_index]
train_x <- chem_final[train_index, ] |> dplyr::select(-Yield)
test_y <- chem_final$Yield[-train_index]
test_x <- chem_final[-train_index, ] |> dplyr::select(-Yield)
# Define the candidate models to test
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
# Fix the seed so that the results can be reproduced
set.seed(100)
# Explicitly declare the candidate models to test
marsTuned <- train(train_x, train_y,
method = "earth",
tuneGrid = marsGrid,
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"))
marsTuned
## Multivariate Adaptive Regression Spline
##
## 140 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 126, 125, 127, 125, 127, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.445585 0.4286929 1.1697364
## 1 3 1.271529 0.5560206 1.0487655
## 1 4 1.221652 0.6127651 0.9909329
## 1 5 1.263234 0.5785473 1.0424573
## 1 6 1.276463 0.5764660 1.0476750
## 1 7 1.259923 0.5988646 1.0247931
## 1 8 1.256659 0.6008041 1.0349147
## 1 9 1.323914 0.5714970 1.0810989
## 1 10 1.329992 0.5720971 1.0872536
## 1 11 1.352325 0.5641226 1.1114240
## 1 12 1.376280 0.5457851 1.1369027
## 1 13 1.365723 0.5536315 1.1278057
## 1 14 1.363425 0.5531568 1.1218102
## 1 15 1.363425 0.5531568 1.1218102
## 1 16 1.363425 0.5531568 1.1218102
## 1 17 1.370906 0.5494370 1.1288829
## 1 18 1.375610 0.5477760 1.1327603
## 1 19 1.375610 0.5477760 1.1327603
## 1 20 1.375610 0.5477760 1.1327603
## 1 21 1.375610 0.5477760 1.1327603
## 1 22 1.375610 0.5477760 1.1327603
## 1 23 1.375610 0.5477760 1.1327603
## 1 24 1.375610 0.5477760 1.1327603
## 1 25 1.375610 0.5477760 1.1327603
## 1 26 1.375610 0.5477760 1.1327603
## 1 27 1.375610 0.5477760 1.1327603
## 1 28 1.375610 0.5477760 1.1327603
## 1 29 1.375610 0.5477760 1.1327603
## 1 30 1.375610 0.5477760 1.1327603
## 1 31 1.375610 0.5477760 1.1327603
## 1 32 1.375610 0.5477760 1.1327603
## 1 33 1.375610 0.5477760 1.1327603
## 1 34 1.375610 0.5477760 1.1327603
## 1 35 1.375610 0.5477760 1.1327603
## 1 36 1.375610 0.5477760 1.1327603
## 1 37 1.375610 0.5477760 1.1327603
## 1 38 1.375610 0.5477760 1.1327603
## 2 2 1.445585 0.4286929 1.1697364
## 2 3 1.364837 0.4910211 1.1169739
## 2 4 1.295405 0.5445814 1.0655822
## 2 5 1.325028 0.5492070 1.0762444
## 2 6 1.291145 0.5884713 1.0574915
## 2 7 1.315753 0.5960677 1.1061360
## 2 8 1.312895 0.5974567 1.0968864
## 2 9 1.284863 0.6086840 1.0661867
## 2 10 1.270438 0.6409868 1.0350719
## 2 11 1.232756 0.6412548 1.0116010
## 2 12 1.224191 0.6466803 1.0070967
## 2 13 1.242985 0.6465221 1.0258268
## 2 14 1.276888 0.6411067 1.0560455
## 2 15 1.360652 0.5848061 1.1078526
## 2 16 1.379017 0.5817154 1.1042628
## 2 17 1.357163 0.5946035 1.0915653
## 2 18 1.372913 0.5930793 1.1036115
## 2 19 1.430412 0.5617581 1.1629271
## 2 20 1.422250 0.5733586 1.1580907
## 2 21 1.424015 0.5815266 1.1529841
## 2 22 1.427942 0.5834527 1.1579311
## 2 23 1.412013 0.5870073 1.1441247
## 2 24 1.408759 0.5881885 1.1343740
## 2 25 1.408759 0.5881885 1.1343740
## 2 26 1.408759 0.5881885 1.1343740
## 2 27 1.408759 0.5881885 1.1343740
## 2 28 1.428444 0.5877352 1.1434987
## 2 29 1.428444 0.5877352 1.1434987
## 2 30 1.428444 0.5877352 1.1434987
## 2 31 1.428444 0.5877352 1.1434987
## 2 32 1.428444 0.5877352 1.1434987
## 2 33 1.428444 0.5877352 1.1434987
## 2 34 1.428444 0.5877352 1.1434987
## 2 35 1.428444 0.5877352 1.1434987
## 2 36 1.428444 0.5877352 1.1434987
## 2 37 1.428444 0.5877352 1.1434987
## 2 38 1.428444 0.5877352 1.1434987
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.
head(predict(marsTuned, test_x))
## y
## [1,] 43.15080
## [2,] 40.94592
## [3,] 42.30278
## [4,] 39.75909
## [5,] 39.47725
## [6,] 41.49819
#checking the most important variables
varImp(marsTuned)
## earth variable importance
##
## Overall
## ManufacturingProcess32 100.0
## ManufacturingProcess09 47.8
## ManufacturingProcess13 0.0
According to this output, only two variables have high importance.
MarsPred <- predict(marsTuned, test_x)
postResample(pred = MarsPred, obs = test_y)
## RMSE Rsquared MAE
## 0.9180288 0.6608032 0.7386973
The RMSE is really small at .918 and the R-Squared is ~.66, so it explains about 66% of the variability. This is better than the linear models I tried.
Using functions from the caret package:
#there was one variable in this data set with a NZV, so we will run this:
knnDescr <- train_x[, -nearZeroVar(train_x)]
set.seed(100)
knnTune <- train(knnDescr,
train_y,
method = "knn",
preProc = c("center", "scale"),
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = "cv"))
knnTune
## k-Nearest Neighbors
##
## 140 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 126, 125, 127, 125, 127, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 1.572664 0.4355856 1.177938
## 2 1.375642 0.5000665 1.076028
## 3 1.320918 0.5551471 1.089611
## 4 1.302554 0.5505480 1.034190
## 5 1.341124 0.5312473 1.071822
## 6 1.336037 0.5427579 1.064899
## 7 1.347944 0.5279841 1.087428
## 8 1.344336 0.5279758 1.105406
## 9 1.353617 0.5331393 1.112105
## 10 1.354839 0.5324870 1.117459
## 11 1.382060 0.5017001 1.131944
## 12 1.370833 0.5053568 1.116786
## 13 1.390234 0.4893131 1.128413
## 14 1.385492 0.5027572 1.127644
## 15 1.399392 0.4899669 1.139616
## 16 1.408471 0.4807428 1.149303
## 17 1.429706 0.4632304 1.173909
## 18 1.437080 0.4598546 1.170098
## 19 1.450011 0.4490827 1.191298
## 20 1.443794 0.4618061 1.189157
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 4.
head(predict(knnTune, test_x))
## [1] 42.025 41.975 42.015 41.425 40.080 40.210
#most important variables are listed in part B
Test the fit:
knnPred <- predict(knnTune, test_x)
postResample(pred = knnPred, obs = test_y)
## RMSE Rsquared MAE
## 1.1632277 0.4874411 0.9598611
The RMSE is pretty low at ~1.16, but R-squared is only ~.49, explaining only about half of the variance.
set.seed(1122)
svmRTuned <- train(train_x, train_y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
svmRTuned
## Support Vector Machines with Radial Basis Function Kernel
##
## 140 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 127, 125, 126, 126, 125, 126, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.452498 0.5109138 1.1701907
## 0.50 1.321489 0.5672822 1.0590686
## 1.00 1.238206 0.6088460 0.9771733
## 2.00 1.192591 0.6268670 0.9248975
## 4.00 1.177580 0.6330657 0.9228731
## 8.00 1.175891 0.6316652 0.9237709
## 16.00 1.175405 0.6318894 0.9231476
## 32.00 1.175405 0.6318894 0.9231476
## 64.00 1.175405 0.6318894 0.9231476
## 128.00 1.175405 0.6318894 0.9231476
## 256.00 1.175405 0.6318894 0.9231476
## 512.00 1.175405 0.6318894 0.9231476
## 1024.00 1.175405 0.6318894 0.9231476
## 2048.00 1.175405 0.6318894 0.9231476
##
## Tuning parameter 'sigma' was held constant at a value of 0.01530119
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01530119 and C = 16.
svmRTuned$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: eps-svr (regression)
## parameter : epsilon = 0.1 cost C = 16
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0153011896520962
##
## Number of Support Vectors : 122
##
## Objective Function Value : -70.0997
## Training error : 0.009114
head(predict(svmRTuned, test_x))
## [1] 42.41294 42.66431 42.09665 40.51665 40.33008 40.34078
#most important variables are listed in part B
#varimp doesn't work on this model
Check the fit:
svmPred <- predict(svmRTuned, test_x)
postResample(pred = svmPred, obs = test_y)
## RMSE Rsquared MAE
## 0.9669417 0.6307160 0.7920620
This is similar to MARS with an r-squared of 63% and RMSE .967.
(a ) Which nonlinear regression model gives the optimal resampling and test set performance?
MARS had the best performance with an RMSE at .918 and the R-Squared is ~.66. However, that is still on the low end (only explains 66% of variance), so I wouldn’t recommend it.
Reproducing the top variables for each model here.
MARS is the optimal model, and there are only two important variables, manufacturing process 32 and 09. In this case, manufacturing processes dominate the list.
#MARS (only shows three)
varImp(marsTuned)
## earth variable importance
##
## Overall
## ManufacturingProcess32 100.0
## ManufacturingProcess09 47.8
## ManufacturingProcess13 0.0
Looking at the other models:
I used varImp with KNN and SVM, but realized something was wrong when
the lists were exactly the same. VarImp seems to be pulling the top 20
most-correlated variables in general. This uses the VIP package to pull
the top variables based on RMSE.
https://rdrr.io/cran/vip/src/R/vi_permute.R https://rpubs.com/erblast/mars
library(vip)
## Warning: package 'vip' was built under R version 4.5.3
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
#SVM
vi_svm <- vi(svmRTuned,
method = "permute",
target = train_y,
metric = "RMSE",
pred_wrapper = predict)
vi_svm
## # A tibble: 57 × 2
## Variable Importance
## <chr> <dbl>
## 1 ManufacturingProcess32 0.526
## 2 ManufacturingProcess36 0.280
## 3 ManufacturingProcess13 0.246
## 4 ManufacturingProcess37 0.222
## 5 ManufacturingProcess24 0.200
## 6 ManufacturingProcess17 0.199
## 7 ManufacturingProcess28 0.195
## 8 ManufacturingProcess11 0.191
## 9 BiologicalMaterial03 0.188
## 10 ManufacturingProcess23 0.173
## # ℹ 47 more rows
#KNN
vi_knn <- vi(knnTune,
method = "permute",
target = train_y,
metric = "RMSE",
pred_wrapper = predict)
vi_knn
## # A tibble: 56 × 2
## Variable Importance
## <chr> <dbl>
## 1 ManufacturingProcess12 0.0910
## 2 ManufacturingProcess36 0.0821
## 3 ManufacturingProcess32 0.0656
## 4 ManufacturingProcess13 0.0626
## 5 BiologicalMaterial03 0.0583
## 6 BiologicalMaterial01 0.0504
## 7 BiologicalMaterial06 0.0465
## 8 ManufacturingProcess25 0.0409
## 9 BiologicalMaterial09 0.0408
## 10 ManufacturingProcess23 0.0399
## # ℹ 46 more rows
Both MARS and SVM saw manufacturing process 32 as the top variable. Mars only shows three, with manufacturing process 32 as #1 (100), man 09 at 47.8 and everything else at 0. KNN shows manufacturing process 36 at the top and 32 next. The degrees of importance are much lower than MARS and SVM.
Manufacturing variables dominate for all three models. Note that there are more manufacturing than biological variables in general.
How they compare to the linear model: Last week, I checked the top variables for elastic net. Bio 06 is at the top (it doesn’t appear in any of the nonlinear results). The top variable for MARS and SVM (process 32) ranks 7th, and process 36 (top variable for KNN) ranks at #9. Overall, they’re different lists, still dominated by manufacturing variables.
Sources for some plotting techniques/the plotmo package:
https://bradleyboehmke.github.io/HOML/mars.html#the-basic-idea https://rpubs.com/erblast/mars
Visualizing all three models and their predictor variables first:
#MARS
top_vars <- evimp(marsTuned$finalModel)
top_10_names <- rownames(top_vars)[1:10]
plotmo(marsTuned, which = 1, predict.terms = top_10_names)
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 6.355 55.09 67.38
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 12.07 18.42 48.46
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 100 17.49 12.83
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 2.63 146.02 20.06
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 11.4 21 1.55
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 934 998.75 206.6
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 177 178 45.805
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 9.1 9.4 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 34.5 4856.5 6033.5
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 4588 34.4 4842
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 6024 4582 -0.3
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 5 3 8
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 4856 6047 4585.5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 10.4 19.9 9.15
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 70.8 158 64
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 2.5 495.5 0.02
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 3 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 0 0 11.6
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 0.8 1.9 2.2
plotmo(marsTuned,
which = 1,
subset = top_10_names,
caption = "MARS",
pt.col = "gray")
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 6.355 55.09 67.38
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 12.07 18.42 48.46
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 100 17.49 12.83
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 2.63 146.02 20.06
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 11.4 21 1.55
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 934 998.75 206.6
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 177 178 45.805
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 9.1 9.4 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 34.5 4856.5 6033.5
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 4588 34.4 4842
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 6024 4582 -0.3
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 5 3 8
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 4856 6047 4585.5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 10.4 19.9 9.15
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 70.8 158 64
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 2.5 495.5 0.02
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 3 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 0 0 11.6
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 0.8 1.9 2.2
#SVM
plotmo(svmRTuned,
which = 1,
degree1 = c("ManufacturingProcess32", "ManufacturingProcess36",
"ManufacturingProcess37", "ManufacturingProcess17",
"ManufacturingProcess11", "ManufacturingProcess13",
"ManufacturingProcess28", "ManufacturingProcess34",
"BiologicalMaterial05", "ManufacturingProcess24"),
caption = "SVM",
pt.col = "gray")
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 6.355 55.09 67.38
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 12.07 18.42 48.46
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 100 17.49 12.83
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 2.63 146.02 20.06
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 11.4 21 1.55
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 934 998.75 206.6
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 177 178 45.805
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 9.1 9.4 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 34.5 4856.5 6033.5
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 4588 34.4 4842
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 6024 4582 -0.3
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 5 3 8
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 4856 6047 4585.5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 10.4 19.9 9.15
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 70.8 158 64
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 2.5 495.5 0.02
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 3 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 0 0 11.6
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 0.8 1.9 2.2
#KNN (looks like it may have overfit)
plotmo(knnTune,
which = 1,
degree1 = c("ManufacturingProcess36",
"ManufacturingProcess13",
"ManufacturingProcess17",
"ManufacturingProcess12",
"BiologicalMaterial02",
"ManufacturingProcess22",
"BiologicalMaterial01",
"ManufacturingProcess07",
"ManufacturingProcess09",
"ManufacturingProcess32"),
caption = "KNN",
pt.col = "gray")
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 6.355 55.09 67.38
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 12.07 18.42 48.46
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 17.49 12.83 2.63
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 146.02 20.06 11.4
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 21 1.55 934
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 998.75 206.6 177
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 178 45.805 9.1
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 9.4 0 34.5
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 4856.5 6033.5 4588
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 34.4 4842 6024
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 4582 -0.3 5
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 3 8 4856
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 6047 4585.5 10.4
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 19.9 9.15 70.8
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 158 64 2.5
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 495.5 0.02 1
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 3 7.2 0
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 0 11.6 0.8
## ManufacturingProcess44 ManufacturingProcess45
## 1.9 2.2
Since MARS only has 3 predictors (one of which the model has assigned a zero value to), I’m pulling the unique predictors for each model, not just the model with the best R-squared.
MARS
col_index <- which(colnames(train_x) == "ManufacturingProcess09")
plotmo(marsTuned$finalModel,
which = 1,
caption = "MARS",
pt.col = "gray",
degree1 = "ManufacturingProcess09")
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## -0.08877564 -0.1522094 -0.09892082
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## -0.1401945 -0.07891408 -0.1332374
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## -0.1474496 0.02130545 -0.02657837
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## -0.2418646 -0.1860545 -0.1778
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 0.1434582 0.5093331 0.4523672
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 0.383586 -0.08741698 -0.2843345
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## -0.9822878 0.8755778 0.08980004
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## -0.1119662 0.03204418 -0.4982111
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## -0.01004455 0.04847142 -0.0871858
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## -0.03366167 0.02889183 0.09163441
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## -0.1084604 0.08252239 -0.1969995
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## -0.112253 -0.01294419 -0.2008734
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## -0.02795428 -0.07705969 -0.069302
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 0.7601398 -0.351222 -0.06190869
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 0.1423564 -0.08135609 0.1887734
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 0.07639713 -0.05353648 0.4666874
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## -0.003159363 0.7421258 0.2112721
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## -0.4532314 -0.4343483 0.1877572
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## -0.1172866 0.2908739 0.1331032
Manufacturing process 09 technically isn’t unique, since it appears in KNN too. Here, the predictor correlates with an increase in yield, but levels off after a certain point, as shown by the model.
SVM
plotmo(svmRTuned,
which = 1,
degree1 = c("ManufacturingProcess37",
"ManufacturingProcess11",
"ManufacturingProcess28",
"ManufacturingProcess34",
"BiologicalMaterial05",
"ManufacturingProcess24"),
caption = "SVM",
pt.col = "gray")
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 6.355 55.09 67.38
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 12.07 18.42 48.46
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 100 17.49 12.83
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 2.63 146.02 20.06
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 11.4 21 1.55
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 934 998.75 206.6
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 177 178 45.805
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 9.1 9.4 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 34.5 4856.5 6033.5
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 4588 34.4 4842
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 6024 4582 -0.3
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 5 3 8
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 4856 6047 4585.5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 10.4 19.9 9.15
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 70.8 158 64
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 2.5 495.5 0.02
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 3 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 0 0 11.6
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 0.8 1.9 2.2
SVM has six unique predictors, all of which have been assigined somewhat linear relationships. These relationships look fairly weak. The variance looks high and, in some cases, slightly higher values in certain sections of the plot look like they may be due to random chance (noise), but SVM has assigned importance to them. It makes sense that the relationships appear less strong, since we are looking at the model’s unique chosen predictors (i.e., predictors no other model chose).
#leaving 09 in here as well to see how a different model treats it
plotmo(knnTune,
which = 1,
degree1 = c(
"ManufacturingProcess12",
"BiologicalMaterial02",
"ManufacturingProcess22",
"BiologicalMaterial01",
"ManufacturingProcess07",
"ManufacturingProcess09"),
caption = "KNN",
pt.col = "gray")
## plotmo grid: BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 6.355 55.09 67.38
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 12.07 18.42 48.46
## BiologicalMaterial08 BiologicalMaterial09 BiologicalMaterial10
## 17.49 12.83 2.63
## BiologicalMaterial11 BiologicalMaterial12 ManufacturingProcess01
## 146.02 20.06 11.4
## ManufacturingProcess02 ManufacturingProcess03 ManufacturingProcess04
## 21 1.55 934
## ManufacturingProcess05 ManufacturingProcess06 ManufacturingProcess07
## 998.75 206.6 177
## ManufacturingProcess08 ManufacturingProcess09 ManufacturingProcess10
## 178 45.805 9.1
## ManufacturingProcess11 ManufacturingProcess12 ManufacturingProcess13
## 9.4 0 34.5
## ManufacturingProcess14 ManufacturingProcess15 ManufacturingProcess16
## 4856.5 6033.5 4588
## ManufacturingProcess17 ManufacturingProcess18 ManufacturingProcess19
## 34.4 4842 6024
## ManufacturingProcess20 ManufacturingProcess21 ManufacturingProcess22
## 4582 -0.3 5
## ManufacturingProcess23 ManufacturingProcess24 ManufacturingProcess25
## 3 8 4856
## ManufacturingProcess26 ManufacturingProcess27 ManufacturingProcess28
## 6047 4585.5 10.4
## ManufacturingProcess29 ManufacturingProcess30 ManufacturingProcess31
## 19.9 9.15 70.8
## ManufacturingProcess32 ManufacturingProcess33 ManufacturingProcess34
## 158 64 2.5
## ManufacturingProcess35 ManufacturingProcess36 ManufacturingProcess37
## 495.5 0.02 1
## ManufacturingProcess38 ManufacturingProcess39 ManufacturingProcess40
## 3 7.2 0
## ManufacturingProcess41 ManufacturingProcess42 ManufacturingProcess43
## 0 11.6 0.8
## ManufacturingProcess44 ManufacturingProcess45
## 1.9 2.2
Again, it looks like KNN overfit on these variables. Process 07, for example, shows two similar clusters on the left and right sides of the graph. The model has seemingly assigned an arbitrary dip in the middle of the graph. Process 12 has a similar problem. Bio 01, process 22, and process 01 all show the model overfitting to small changes in the data, dipping and rising in different parts of the graph.
This is a common problem for KNN, particularly when k=1. However, this model selected k = 4. Overfitting can also hapen with data like this, where many of the variables don’t seem to have a strong relationship with yield.
Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Based on the unique predictors, the relationships the models picked for SVM and KNN are not strong, and KNN has overfit/shown relationships that may not be valid.
Looking at all the manufacturing predictors (from the first set of charts), some variables, like process 32 and 36 seem to have a stronger nonlinear correlation with yield. In Mars’s model, I would like more data >37 for process 13, which could help validate the relationship or show a downward slope as x increases.