Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10 \sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + \text{noise}\)
Which models appear to give the best performance? Does MARS select the informative predictors (those named x1–x5)?
## Loading required package: Rcpp
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:RSNNS':
##
## confusionMatrix, train
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(200)
trainingData <- mlbench.friedman1(200, sd=1)
# convert x data from a matrix to datafrane
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
#This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd=1)
testData$x <- data.frame(testData$x)
featurePlot(testData$x, testData$y)
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10
)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Model Performance (k-Nearest Neighbors)
From the KNN model tuning results: - Best RMSE
(lowest): 3.18 (at k = 19
) - Best R²
(highest): 0.652 (also at k = 19
) - Final model
selected: k = 19 - Test
performance:
- RMSE: 3.229
- R²: 0.687
These results suggest that KNN with k = 19
gives the
best performance among the KNN models tested, showing reasonable
predictive accuracy on the nonlinear Friedman1 simulation.
library(nnet)
set.seed(421)
layers <- 5
avnnet_fit <- avNNet(
trainingData$x,
trainingData$y,
decay = .01,
size = layers,
repeats = 5,
maxit = 500,
bag = TRUE
)
## Warning: executing %dopar% sequentially: no parallel backend registered
## Fitting Repeat 1
##
## # weights: 61
## initial value 45533.812204
## iter 10 value 42651.755288
## iter 20 value 42550.319453
## iter 30 value 42548.534053
## iter 40 value 42548.422007
## iter 40 value 42548.421641
## iter 40 value 42548.421506
## final value 42548.421506
## converged
## Fitting Repeat 2
##
## # weights: 61
## initial value 43344.694521
## iter 10 value 40188.948530
## iter 20 value 40108.975272
## iter 30 value 40107.586044
## iter 40 value 40107.495096
## iter 40 value 40107.494937
## iter 40 value 40107.494884
## final value 40107.494884
## converged
## Fitting Repeat 3
##
## # weights: 61
## initial value 45575.552973
## iter 10 value 42240.911256
## iter 20 value 42178.065608
## iter 30 value 42176.176629
## iter 40 value 42176.052626
## final value 42176.050418
## converged
## Fitting Repeat 4
##
## # weights: 61
## initial value 45002.076591
## iter 10 value 41253.095282
## iter 20 value 41251.286533
## final value 41251.245119
## converged
## Fitting Repeat 5
##
## # weights: 61
## initial value 42723.104381
## iter 10 value 39743.427266
## iter 20 value 39731.736479
## iter 30 value 39731.636476
## iter 30 value 39731.636253
## iter 30 value 39731.636235
## final value 39731.636235
## converged
nnet_pred <- predict(
avnnet_fit,
newdata = testData$x
)
postResample(
pred = nnet_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 14.2769353 0.2814884 13.3869178
Multivariate Adaptive Regression Splines (MARS)
MARS correctly identifies and models the nonlinearities and interactions in the data (e.g., sin, square terms). - The plots also help validate that the important predictors are X1–X5, with X6 being non-informative, as intended in the Friedman1 dataset. - These plots are a great diagnostic tool for feature importance and interpretability in nonlinear models like MARS.
library(earth)
mars_fit <- earth(
trainingData$x,
trainingData$y
)
summary(mars_fit)
## Call: earth(x=trainingData$x, y=trainingData$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
plotmo(mars_fit)
## plotmo grid: X1 X2 X3 X4 X5 X6 X7
## 0.5139349 0.5106664 0.537307 0.4445841 0.5343299 0.4975981 0.4688035
## X8 X9 X10
## 0.497961 0.5288716 0.5359218
mars_pred <- predict(
mars_fit,
newdata = testData$x
)
postResample(
pred = mars_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 1.8136467 0.8677298 1.3911836
mars_iter_fit <- train(
trainingData$x,
trainingData$y,
method = "earth",
tuneGrid = expand.grid(.degree = 1:4, .nprune = 2:50),
trControl = trainControl(method = "cv")
)
mars_iter_fit$finalModel
## Selected 15 of 18 terms, and 5 of 10 predictors (nprune=15)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6-unused, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 10 4
## GCV 1.618197 RSS 217.6151 GRSq 0.9343005 RSq 0.9553786
plot(varImp(mars_iter_fit))
mars_iter_pred <- predict(
mars_iter_fit,
newdata = testData$x
)
postResample(
pred = mars_iter_pred,
obs = testData$y
)
## RMSE Rsquared MAE
## 1.1589948 0.9460418 0.9250230
MARS performs exceptionally well, achieving the highest R² and lowest GCV among all models. Importantly, it automatically identified the informative predictors (X1 to X5), which aligns with the true structure of the Friedman1 simulated data.
MARS is the best-performing model and shows strong variable selection capabilities.
Neural networks may benefit from PCA, especially on datasets like Tecator or Friedman1, where features are highly correlated.
KNN offers a solid balance of simplicity and predictive power but may lag behind more flexible models like MARS.
Exercise 7.5
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
# Make this reproducible
set.seed(42)
knn_model <- preProcess(chem, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)
df <- df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
Yield ~ ., data = train_df, method = "pls",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
pls_model
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8824790 0.3779221 0.6711462
## 2 1.1458456 0.4219806 0.7086431
## 3 0.7363066 0.5244517 0.5688553
## 4 0.8235294 0.5298005 0.5933120
## 5 0.9670735 0.4846010 0.6371199
## 6 0.9959036 0.4776684 0.6427478
## 7 0.9119517 0.4986338 0.6200233
## 8 0.9068621 0.5012144 0.6293371
## 9 0.8517370 0.5220166 0.6163795
## 10 0.8919356 0.5062912 0.6332243
## 11 0.9173758 0.4934557 0.6463164
## 12 0.9064125 0.4791526 0.6485663
## 13 0.9255289 0.4542181 0.6620193
## 14 1.0239913 0.4358371 0.6944056
## 15 1.0754710 0.4365214 0.7077991
## 16 1.1110579 0.4269065 0.7135684
## 17 1.1492855 0.4210485 0.7222868
## 18 1.1940639 0.4132534 0.7396357
## 19 1.2271867 0.4079005 0.7494818
## 20 1.2077102 0.4022859 0.7470327
## 21 1.2082648 0.4026711 0.7452969
## 22 1.2669285 0.3987044 0.7634170
## 23 1.3663033 0.3970188 0.7957514
## 24 1.4531634 0.3898475 0.8243034
## 25 1.5624265 0.3820102 0.8612935
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
pls_predictions <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "PLS")
results
## RMSE Rsquared MAE Model
## 1 0.6192577 0.6771122 0.5059984 PLS
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
chem <- ChemicalManufacturingProcess
# Make this reproducible
set.seed(42)
knn_model <- preProcess(chem, "knnImpute")
df <- predict(knn_model, ChemicalManufacturingProcess)
df <- df %>%
select_at(vars(-one_of(nearZeroVar(., names = TRUE))))
in_train <- createDataPartition(df$Yield, times = 1, p = 0.8, list = FALSE)
train_df <- df[in_train, ]
test_df <- df[-in_train, ]
pls_model <- train(
Yield ~ ., data = train_df, method = "pls",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
pls_model
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 128, 129, 130, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 0.8824790 0.3779221 0.6711462
## 2 1.1458456 0.4219806 0.7086431
## 3 0.7363066 0.5244517 0.5688553
## 4 0.8235294 0.5298005 0.5933120
## 5 0.9670735 0.4846010 0.6371199
## 6 0.9959036 0.4776684 0.6427478
## 7 0.9119517 0.4986338 0.6200233
## 8 0.9068621 0.5012144 0.6293371
## 9 0.8517370 0.5220166 0.6163795
## 10 0.8919356 0.5062912 0.6332243
## 11 0.9173758 0.4934557 0.6463164
## 12 0.9064125 0.4791526 0.6485663
## 13 0.9255289 0.4542181 0.6620193
## 14 1.0239913 0.4358371 0.6944056
## 15 1.0754710 0.4365214 0.7077991
## 16 1.1110579 0.4269065 0.7135684
## 17 1.1492855 0.4210485 0.7222868
## 18 1.1940639 0.4132534 0.7396357
## 19 1.2271867 0.4079005 0.7494818
## 20 1.2077102 0.4022859 0.7470327
## 21 1.2082648 0.4026711 0.7452969
## 22 1.2669285 0.3987044 0.7634170
## 23 1.3663033 0.3970188 0.7957514
## 24 1.4531634 0.3898475 0.8243034
## 25 1.5624265 0.3820102 0.8612935
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 3.
pls_predictions <- predict(pls_model, test_df)
results <- data.frame(t(postResample(pred = pls_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "PLS")
results
## RMSE Rsquared MAE Model
## 1 0.6192577 0.6771122 0.5059984 PLS
KNN Model
knn_model <- train(
Yield ~ ., data = train_df, method = "knn",
center = TRUE,
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
knn_model
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 130, 130, 130, 131, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.7085987 0.4846711 0.5709345
## 7 0.7331379 0.4559107 0.5981516
## 9 0.7417741 0.4556845 0.6124459
## 11 0.7505969 0.4526780 0.6277377
## 13 0.7450412 0.4691830 0.6181457
## 15 0.7539451 0.4604423 0.6246762
## 17 0.7663763 0.4474991 0.6306860
## 19 0.7694326 0.4480450 0.6315047
## 21 0.7706841 0.4508441 0.6288689
## 23 0.7780866 0.4419184 0.6350889
## 25 0.7834333 0.4320541 0.6382165
## 27 0.7877890 0.4339581 0.6454522
## 29 0.7981302 0.4187796 0.6533767
## 31 0.8060716 0.3977540 0.6610186
## 33 0.8063566 0.4128701 0.6600828
## 35 0.8144040 0.3964037 0.6671734
## 37 0.8160580 0.4009313 0.6662999
## 39 0.8220176 0.3884786 0.6693883
## 41 0.8274778 0.3750873 0.6746765
## 43 0.8288000 0.3820179 0.6758220
## 45 0.8285016 0.3878549 0.6747216
## 47 0.8331504 0.3808342 0.6774480
## 49 0.8368096 0.3816669 0.6829726
## 51 0.8391074 0.3804316 0.6857959
## 53 0.8442763 0.3704957 0.6915276
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
knn_predictions <- predict(knn_model, test_df)
results <- data.frame(t(postResample(pred = knn_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "KNN") %>% rbind(results)
print(results)
## RMSE Rsquared MAE Model
## 1 0.7149819 0.5894607 0.5047973 KNN
## 2 0.6192577 0.6771122 0.5059984 PLS
MARS MODEL
MARS_grid <- expand.grid(.degree = 1:2, .nprune = 2:15)
MARS_model <- train(
Yield ~ ., data = train_df, method = "earth",
tuneGrid = MARS_grid,
# If the following lines are uncommented, it throws an error
#center = TRUE,
#scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 25
)
MARS_model
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 128, 130, 128, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.7744594 0.4159942 0.6084363
## 1 3 0.7003597 0.5682276 0.5530374
## 1 4 0.6469215 0.6086750 0.5154043
## 1 5 0.6404520 0.6073071 0.5042864
## 1 6 0.6960964 0.5396157 0.5588565
## 1 7 0.6997921 0.5521698 0.5669813
## 1 8 0.7147341 0.5423472 0.5621022
## 1 9 0.7173559 0.5508650 0.5641610
## 1 10 0.7487512 0.5325936 0.5869088
## 1 11 0.7415722 0.5353676 0.5847493
## 1 12 0.7353149 0.5519142 0.5768069
## 1 13 0.7450704 0.5357967 0.5887685
## 1 14 0.7419274 0.5400866 0.5844385
## 1 15 0.7262459 0.5553445 0.5775694
## 2 2 0.7814439 0.4052275 0.6150579
## 2 3 0.6961165 0.5526862 0.5636101
## 2 4 0.6983163 0.5544842 0.5575337
## 2 5 0.6730604 0.5897508 0.5262153
## 2 6 0.6707267 0.5848137 0.5147009
## 2 7 0.6592644 0.5959026 0.5199324
## 2 8 0.6170179 0.6400977 0.4808173
## 2 9 0.5848818 0.6718214 0.4591934
## 2 10 0.5844613 0.6757433 0.4556203
## 2 11 0.5868888 0.6684397 0.4622055
## 2 12 0.5804160 0.6762717 0.4550110
## 2 13 0.5822871 0.6910726 0.4475039
## 2 14 0.5914308 0.6855774 0.4499391
## 2 15 0.5796279 0.6970556 0.4450954
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
MARS_predictions <- predict(MARS_model, test_df)
results <- data.frame(t(postResample(pred = MARS_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "MARS") %>% rbind(results)
print(results)
## RMSE Rsquared MAE Model
## 1 0.7694337 0.5378346 0.5809592 MARS
## 2 0.7149819 0.5894607 0.5047973 KNN
## 3 0.6192577 0.6771122 0.5059984 PLS
SVM Model
SVM_model <- train(
Yield ~ ., data = train_df, method = "svmRadial",
center = TRUE,
scale = TRUE,
trControl = trainControl(method = "cv"),
tuneLength = 25
)
SVM_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7683015 0.4646282 0.6392703
## 0.50 0.7174781 0.4848381 0.5965095
## 1.00 0.6686679 0.5330949 0.5530530
## 2.00 0.6409061 0.5552794 0.5275947
## 4.00 0.6356640 0.5619233 0.5166512
## 8.00 0.6286151 0.5761807 0.5115837
## 16.00 0.6287131 0.5760690 0.5116965
## 32.00 0.6287131 0.5760690 0.5116965
## 64.00 0.6287131 0.5760690 0.5116965
## 128.00 0.6287131 0.5760690 0.5116965
## 256.00 0.6287131 0.5760690 0.5116965
## 512.00 0.6287131 0.5760690 0.5116965
## 1024.00 0.6287131 0.5760690 0.5116965
## 2048.00 0.6287131 0.5760690 0.5116965
## 4096.00 0.6287131 0.5760690 0.5116965
## 8192.00 0.6287131 0.5760690 0.5116965
## 16384.00 0.6287131 0.5760690 0.5116965
## 32768.00 0.6287131 0.5760690 0.5116965
## 65536.00 0.6287131 0.5760690 0.5116965
## 131072.00 0.6287131 0.5760690 0.5116965
## 262144.00 0.6287131 0.5760690 0.5116965
## 524288.00 0.6287131 0.5760690 0.5116965
## 1048576.00 0.6287131 0.5760690 0.5116965
## 2097152.00 0.6287131 0.5760690 0.5116965
## 4194304.00 0.6287131 0.5760690 0.5116965
##
## Tuning parameter 'sigma' was held constant at a value of 0.015148
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.015148 and C = 8.
SVM_predictions <- predict(SVM_model, test_df)
results <- data.frame(t(postResample(pred = SVM_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "SVM") %>% rbind(results)
Neural Network Model
nnet_grid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnet_maxnwts <- 5 * ncol(train_df) + 5 + 1
nnet_model <- train(
Yield ~ ., data = train_df, method = "avNNet",
center = TRUE,
scale = TRUE,
tuneGrid = nnet_grid,
trControl = trainControl(method = "cv"),
linout = TRUE,
trace = FALSE,
MaxNWts = nnet_maxnwts,
maxit = 500
)
nnet_model
## Model Averaged Neural Network
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 131, 129, 130, 128, 130, 129, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.7910248 0.4228412 0.6130383
## 0.00 2 0.9218862 0.4080614 0.7428019
## 0.00 3 0.7740539 0.5185395 0.6172976
## 0.00 4 0.7024795 0.5648305 0.5597466
## 0.00 5 0.6390399 0.6194867 0.5144565
## 0.00 6 NaN NaN NaN
## 0.00 7 NaN NaN NaN
## 0.00 8 NaN NaN NaN
## 0.00 9 NaN NaN NaN
## 0.00 10 NaN NaN NaN
## 0.01 1 0.7696626 0.4978424 0.6083146
## 0.01 2 0.7584432 0.5120800 0.6254292
## 0.01 3 0.7995990 0.5296820 0.6455941
## 0.01 4 0.6883856 0.5995827 0.5517506
## 0.01 5 0.6843272 0.5975191 0.5467633
## 0.01 6 NaN NaN NaN
## 0.01 7 NaN NaN NaN
## 0.01 8 NaN NaN NaN
## 0.01 9 NaN NaN NaN
## 0.01 10 NaN NaN NaN
## 0.10 1 0.7121318 0.5409956 0.5882917
## 0.10 2 0.6924135 0.5545770 0.5497616
## 0.10 3 0.6739546 0.6037860 0.5412176
## 0.10 4 0.6559497 0.6294231 0.5376020
## 0.10 5 0.6453881 0.6413499 0.5201350
## 0.10 6 NaN NaN NaN
## 0.10 7 NaN NaN NaN
## 0.10 8 NaN NaN NaN
## 0.10 9 NaN NaN NaN
## 0.10 10 NaN NaN NaN
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5, decay = 0 and bag = FALSE.
Based on the performance metrics provided, the Partial Least Squares (PLS) model demonstrated the best overall performance among the models compared. It achieved the lowest root mean squared error (RMSE) of 0.6193, indicating the smallest average prediction error, and the highest R-squared value of 0.6771, meaning it explained the most variance in the outcome variable. Additionally, its mean absolute error (MAE) of 0.5060 was among the lowest, further confirming its accuracy and consistency. While the Support Vector Machine (SVM) model also performed well with a slightly lower MAE, its higher RMSE and lower R-squared make it slightly less favorable than PLS. On the other hand, the MARS model showed the weakest performance, with the highest RMSE and lowest R-squared. Overall, the PLS model stands out as the most effective choice based on the evaluation metrics.
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
nnet_predictions <- predict(nnet_model, test_df)
results <- data.frame(t(postResample(pred = nnet_predictions, obs = test_df$Yield))) %>%
mutate("Model"= "Neural Network") %>% rbind(results)
results %>% unique() %>% kable() %>% kable_styling()
RMSE | Rsquared | MAE | Model |
---|---|---|---|
0.7197787 | 0.5648347 | 0.5760858 | Neural Network |
0.6815833 | 0.6251941 | 0.4926105 | SVM |
0.7694337 | 0.5378346 | 0.5809592 | MARS |
0.7149819 | 0.5894607 | 0.5047973 | KNN |
0.6192577 | 0.6771122 | 0.5059984 | PLS |
The model relies most heavily on ManufacturingProcess32, but a combination of process- and material-related features drive the predictive performance of the SVM model. This type of output is useful for feature selection, interpretability, and guiding domain-specific investigations into critical quality factors.
The consistently high ranking of ManufacturingProcess32, 13, 09, and 36 suggests they are key drivers of the target variable. The presence of BiologicalMaterial features indicates the interaction between material properties and process controls is critical in modeling performance or quality outcomes for pls model.
varImp(SVM_model, 10)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 93.82
## ManufacturingProcess09 89.93
## ManufacturingProcess17 88.20
## BiologicalMaterial06 82.61
## BiologicalMaterial03 79.44
## ManufacturingProcess36 73.85
## BiologicalMaterial12 72.36
## ManufacturingProcess06 69.00
## ManufacturingProcess11 62.34
## ManufacturingProcess31 56.39
## BiologicalMaterial02 50.34
## BiologicalMaterial11 48.53
## BiologicalMaterial09 44.76
## ManufacturingProcess30 41.87
## BiologicalMaterial08 40.24
## ManufacturingProcess29 38.54
## ManufacturingProcess33 38.16
## BiologicalMaterial04 36.92
## ManufacturingProcess25 36.83
varImp(pls_model,10)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 88.04
## ManufacturingProcess36 82.20
## ManufacturingProcess13 82.11
## ManufacturingProcess17 80.25
## ManufacturingProcess06 59.06
## ManufacturingProcess11 55.93
## BiologicalMaterial02 55.46
## BiologicalMaterial06 54.64
## BiologicalMaterial03 54.50
## ManufacturingProcess33 53.91
## ManufacturingProcess12 52.04
## BiologicalMaterial08 49.76
## BiologicalMaterial12 47.40
## ManufacturingProcess34 45.47
## BiologicalMaterial11 45.05
## BiologicalMaterial01 44.18
## BiologicalMaterial04 42.95
## ManufacturingProcess04 39.94
## ManufacturingProcess28 36.61
-c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
predictors_<- varImp(SVM_model)$importance |>
arrange(desc(Overall)) |>
head(10)
SVM_predictions <- predict(SVM_model, test_df)
relationship_df <- test_df |> mutate(Yield = SVM_predictions)
for (predictor in rownames(predictors_)){
plot <- ggplot(relationship_df, aes_string(x = predictor, y = "Yield")) +
geom_point(color = "blue", alpha = 0.6) +
labs(
title = paste("Relationship between", predictor, "and Yield"),
x = predictor,
y = "Yield"
) +
theme_minimal()
print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Yes, these scatter plots reveal meaningful intuition about the relationship between key biological and process variables and the yield. The visualizations show that several features—particularly ManufacturingProcess32, BiologicalMaterial06, and ManufacturingProcess13—demonstrate clear linear or non-linear trends with the yield. For example, ManufacturingProcess32 has a strong positive linear relationship with yield, which aligns with its consistently high importance score in the models. Similarly, BiologicalMaterial06 and BiologicalMaterial03 show upward trends, suggesting that as the value of these biological material attributes increases, so does the yield. On the other hand, variables like ManufacturingProcess36 and ManufacturingProcess06 show more scattered relationships, indicating weaker or more complex interactions with the target. Overall, the plots support the importance rankings generated from the model and highlight that certain process and biological factors are strongly associated with yield, offering both predictive power and domain insight.