Friedman (1991) introduced several benchmark data sets created by
simulation.
One of these simulations used the following nonlinear equation:
y = 10 sin(pix1x2) + 20(x3 - 0.5)^2 + 10x4 + 5x5 + N(0, sigma^2)
Tune several models on these data.
library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
featurePlot(trainingData$x, trainingData$y)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
# KNN
set.seed(200)
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv")
)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.238598 0.5836232 2.705822
## 7 3.117335 0.6295372 2.561052
## 9 3.100423 0.6590940 2.524483
## 11 3.086639 0.6822198 2.506584
## 13 3.094904 0.6902613 2.504433
## 15 3.116059 0.7045172 2.516131
## 17 3.129874 0.7133067 2.529370
## 19 3.151840 0.7183283 2.546422
## 21 3.175787 0.7209301 2.574113
## 23 3.208213 0.7146199 2.611285
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 11.
# MARS
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv")
)
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.188280 0.3042527 3.460689
## 1 3 3.551182 0.4999832 2.837116
## 1 4 2.653143 0.7167280 2.128222
## 1 5 2.405769 0.7562160 1.948161
## 1 6 2.295006 0.7754603 1.853199
## 1 7 1.771950 0.8611767 1.391357
## 1 8 1.647182 0.8774867 1.299564
## 1 9 1.609816 0.8837307 1.299705
## 1 10 1.635035 0.8798236 1.309436
## 1 11 1.571915 0.8896147 1.260711
## 1 12 1.571561 0.8898750 1.253077
## 1 13 1.567577 0.8906927 1.250795
## 1 14 1.571673 0.8909652 1.245508
## 1 15 1.571673 0.8909652 1.245508
## 1 16 1.571673 0.8909652 1.245508
## 1 17 1.571673 0.8909652 1.245508
## 1 18 1.571673 0.8909652 1.245508
## 1 19 1.571673 0.8909652 1.245508
## 1 20 1.571673 0.8909652 1.245508
## 1 21 1.571673 0.8909652 1.245508
## 1 22 1.571673 0.8909652 1.245508
## 1 23 1.571673 0.8909652 1.245508
## 1 24 1.571673 0.8909652 1.245508
## 1 25 1.571673 0.8909652 1.245508
## 1 26 1.571673 0.8909652 1.245508
## 1 27 1.571673 0.8909652 1.245508
## 1 28 1.571673 0.8909652 1.245508
## 1 29 1.571673 0.8909652 1.245508
## 1 30 1.571673 0.8909652 1.245508
## 1 31 1.571673 0.8909652 1.245508
## 1 32 1.571673 0.8909652 1.245508
## 1 33 1.571673 0.8909652 1.245508
## 1 34 1.571673 0.8909652 1.245508
## 1 35 1.571673 0.8909652 1.245508
## 1 36 1.571673 0.8909652 1.245508
## 1 37 1.571673 0.8909652 1.245508
## 1 38 1.571673 0.8909652 1.245508
## 2 2 4.188280 0.3042527 3.460689
## 2 3 3.551182 0.4999832 2.837116
## 2 4 2.615256 0.7216809 2.128763
## 2 5 2.344223 0.7683855 1.890080
## 2 6 2.275048 0.7762472 1.807779
## 2 7 1.841464 0.8418935 1.457945
## 2 8 1.641647 0.8839822 1.288520
## 2 9 1.535119 0.9002991 1.214772
## 2 10 1.473254 0.9101555 1.158761
## 2 11 1.379476 0.9207735 1.080991
## 2 12 1.285380 0.9283193 1.033426
## 2 13 1.267261 0.9328905 1.014726
## 2 14 1.261797 0.9327541 1.009821
## 2 15 1.266663 0.9320714 1.005751
## 2 16 1.270858 0.9322465 1.009757
## 2 17 1.263778 0.9327687 1.007653
## 2 18 1.263778 0.9327687 1.007653
## 2 19 1.263778 0.9327687 1.007653
## 2 20 1.263778 0.9327687 1.007653
## 2 21 1.263778 0.9327687 1.007653
## 2 22 1.263778 0.9327687 1.007653
## 2 23 1.263778 0.9327687 1.007653
## 2 24 1.263778 0.9327687 1.007653
## 2 25 1.263778 0.9327687 1.007653
## 2 26 1.263778 0.9327687 1.007653
## 2 27 1.263778 0.9327687 1.007653
## 2 28 1.263778 0.9327687 1.007653
## 2 29 1.263778 0.9327687 1.007653
## 2 30 1.263778 0.9327687 1.007653
## 2 31 1.263778 0.9327687 1.007653
## 2 32 1.263778 0.9327687 1.007653
## 2 33 1.263778 0.9327687 1.007653
## 2 34 1.263778 0.9327687 1.007653
## 2 35 1.263778 0.9327687 1.007653
## 2 36 1.263778 0.9327687 1.007653
## 2 37 1.263778 0.9327687 1.007653
## 2 38 1.263778 0.9327687 1.007653
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
# SVM Radial
set.seed(200)
svmModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv")
)
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.525164 0.7810576 2.010680
## 0.50 2.270567 0.7944850 1.794902
## 1.00 2.099319 0.8155594 1.659342
## 2.00 2.005858 0.8302852 1.578799
## 4.00 1.934650 0.8435677 1.528373
## 8.00 1.915653 0.8475592 1.528614
## 16.00 1.923884 0.8463090 1.535976
## 32.00 1.923884 0.8463090 1.535976
## 64.00 1.923884 0.8463090 1.535976
## 128.00 1.923884 0.8463090 1.535976
## 256.00 1.923884 0.8463090 1.535976
## 512.00 1.923884 0.8463090 1.535976
## 1024.00 1.923884 0.8463090 1.535976
## 2048.00 1.923884 0.8463090 1.535976
##
## Tuning parameter 'sigma' was held constant at a value of 0.06299324
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06299324 and C = 8.
# Neural Network
set.seed(200)
nnetGrid <- expand.grid(
.decay = c(0, 0.01, 0.1),
.size = c(1:10),
.bag = FALSE
)
nnetModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGrid,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
maxit = 500,
trControl = trainControl(method = "cv")
)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetModel
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.399955 0.7641657 1.892591
## 0.00 2 2.422496 0.7597612 1.940053
## 0.00 3 2.048209 0.8173992 1.637855
## 0.00 4 1.942073 0.8365333 1.554821
## 0.00 5 2.269670 0.7944413 1.738737
## 0.00 6 3.145864 0.7121045 2.242803
## 0.00 7 4.255063 0.4944622 2.730696
## 0.00 8 5.087452 0.5248100 2.898824
## 0.00 9 4.852871 0.5247125 2.790581
## 0.00 10 3.634054 0.6323073 2.472412
## 0.01 1 2.380902 0.7641801 1.871310
## 0.01 2 2.456920 0.7487966 1.925584
## 0.01 3 2.152617 0.8037267 1.690709
## 0.01 4 1.926277 0.8453343 1.547265
## 0.01 5 2.143562 0.8074224 1.717004
## 0.01 6 2.140588 0.8081466 1.696456
## 0.01 7 2.379436 0.7716195 1.850761
## 0.01 8 2.344241 0.7796374 1.845464
## 0.01 9 2.287667 0.7808732 1.739273
## 0.01 10 2.306110 0.7895454 1.833795
## 0.10 1 2.392295 0.7614560 1.873844
## 0.10 2 2.437044 0.7557121 1.918838
## 0.10 3 2.136581 0.8043180 1.702667
## 0.10 4 2.009700 0.8245206 1.574397
## 0.10 5 2.015255 0.8346002 1.586707
## 0.10 6 2.030267 0.8295288 1.586079
## 0.10 7 2.129594 0.8106161 1.699485
## 0.10 8 2.205965 0.8012470 1.719758
## 0.10 9 2.230354 0.8005457 1.736719
## 0.10 10 2.371577 0.7699950 1.866457
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.01 and bag = FALSE.
# Compare via resampling
results72 <- resamples(list(
KNN = knnModel,
MARS = marsModel,
SVM = svmModel,
NeuralNet = nnetModel
))
summary(results72)
##
## Call:
## summary.resamples(object = results72)
##
## Models: KNN, MARS, SVM, NeuralNet
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 1.9925181 2.203288 2.3861486 2.506584 2.768301 3.319190 0
## MARS 0.7326193 0.813534 0.9653755 1.009821 1.209518 1.395006 0
## SVM 1.2043965 1.425654 1.5320210 1.528614 1.564950 1.878901 0
## NeuralNet 1.0007594 1.431442 1.5112167 1.547265 1.687422 1.987858 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 2.5824845 2.797964 3.039929 3.086639 3.292676 3.858210 0
## MARS 0.8900136 1.026360 1.253189 1.261797 1.430436 1.691489 0
## SVM 1.4519357 1.834071 1.918422 1.915653 2.026748 2.362976 0
## NeuralNet 1.2308634 1.791444 1.838269 1.926277 2.088252 2.670345 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 0.4810928 0.6450407 0.6663713 0.6822198 0.7565689 0.8154310 0
## MARS 0.8759205 0.9008865 0.9424540 0.9327541 0.9642843 0.9832637 0
## SVM 0.7329985 0.8369792 0.8625555 0.8475592 0.8685317 0.9254094 0
## NeuralNet 0.6430600 0.8050521 0.8833792 0.8453343 0.8963460 0.9313138 0
# Test set performance
postResample(pred = predict(knnModel, newdata = testData$x), obs = testData$y)
## RMSE Rsquared MAE
## 3.1222641 0.6690472 2.4963650
postResample(pred = predict(marsModel, newdata = testData$x), obs = testData$y)
## RMSE Rsquared MAE
## 1.1722635 0.9448890 0.9324923
postResample(pred = predict(svmModel, newdata = testData$x), obs = testData$y)
## RMSE Rsquared MAE
## 2.0541197 0.8290353 1.5586411
postResample(pred = predict(nnetModel, newdata = testData$x), obs = testData$y)
## RMSE Rsquared MAE
## 2.0603975 0.8320657 1.5289921
# Does MARS select informative predictors X1-X5?
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.24
## X2 48.74
## X5 15.53
## X3 0.00
The Friedman (1991) simulated dataset was used to compare several nonlinear regression models, including K-Nearest Neighbors (KNN), MARS, Support Vector Machines (SVM), and Neural Networks. From the feature plot, it looked like predictors X1–X5 had stronger relationships with the response, while X6–X10 looked more random and did not appear to contribute much.
Based on the cross-validation results, the MARS model performed the best overall. It had the lowest RMSE of 1.2618, the highest R-squared of 0.9328, and the lowest MAE of 1.0098 compared to the other models. The final MARS model selected was a second-degree model with degree = 2 and nprune = 14, which allowed it to capture nonlinear patterns and interactions better than the other methods.
The test set results also confirmed that MARS was the strongest model. Its test RMSE was 1.1723 with an R-squared of 0.9449, which was much better than KNN (RMSE = 3.1223), SVM (RMSE = 2.0541), and Neural Networks (RMSE = 2.0604).
For the question about whether MARS selected the informative predictors, the variable importance results showed that it correctly identified the important variables from the original Friedman equation. The top predictors were X1, X4, X2, X5, and X3. Since the true model was built using X1 through X5, this shows that MARS successfully identified the real informative predictors and did not give importance to the noise variables X6–X10.
Overall, MARS gave the best performance and also did the best job of identifying the true predictors driving the response.
7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
library(caret)
data(ChemicalManufacturingProcess)
x_raw <- ChemicalManufacturingProcess[, -1]
y_raw <- ChemicalManufacturingProcess[, 1]
# Impute missing values
imputer <- preProcess(x_raw, method = "knnImpute")
x_imputed <- predict(imputer, x_raw)
# Remove near-zero variance predictors
nzv <- nearZeroVar(x_imputed)
x_clean <- x_imputed[, -nzv]
# Train/test split
set.seed(200)
trainIndex <- createDataPartition(y_raw, p = 0.8, list = FALSE)
x_train <- x_clean[trainIndex, ]
x_test <- x_clean[-trainIndex, ]
y_train <- y_raw[trainIndex]
y_test <- y_raw[-trainIndex]
# Neural Network
set.seed(200)
nnetGrid <- expand.grid(
.decay = c(0, 0.01, 0.1),
.size = c(1:10),
.bag = FALSE
)
nnet_model <- train(
x = x_train,
y = y_train,
method = "avNNet",
tuneGrid = nnetGrid,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
maxit = 500,
trControl = trainControl(method = "cv")
)
nnet_model
## Model Averaged Neural Network
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.523363 0.3604747 1.234126
## 0.00 2 1.520574 0.3503214 1.209887
## 0.00 3 1.764840 0.3162132 1.440432
## 0.00 4 2.103388 0.2918259 1.658930
## 0.00 5 2.023664 0.3056474 1.504181
## 0.00 6 2.105432 0.2747294 1.667130
## 0.00 7 2.751783 0.2608817 2.096610
## 0.00 8 4.014804 0.1738079 3.004669
## 0.00 9 5.360347 0.2653903 3.734766
## 0.00 10 7.660111 0.1505424 5.132431
## 0.01 1 1.544400 0.3987394 1.214985
## 0.01 2 1.479339 0.4718284 1.207249
## 0.01 3 1.419732 0.5061298 1.161174
## 0.01 4 1.567317 0.4243090 1.270015
## 0.01 5 1.851476 0.3654024 1.376861
## 0.01 6 1.520343 0.4686866 1.180328
## 0.01 7 1.456154 0.4675767 1.201409
## 0.01 8 1.709202 0.4808853 1.367491
## 0.01 9 2.310405 0.3996200 1.717039
## 0.01 10 2.385714 0.2537867 1.775583
## 0.10 1 1.469752 0.4537444 1.177949
## 0.10 2 1.655718 0.4423940 1.273301
## 0.10 3 1.592602 0.4914365 1.251160
## 0.10 4 1.881032 0.3930018 1.336763
## 0.10 5 1.724776 0.4863146 1.266150
## 0.10 6 1.729358 0.4206057 1.264146
## 0.10 7 1.684171 0.4585358 1.252075
## 0.10 8 1.912418 0.3658850 1.377345
## 0.10 9 1.619666 0.4384281 1.219826
## 0.10 10 1.512825 0.4127407 1.239338
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3, decay = 0.01 and bag = FALSE.
# MARS
set.seed(200)
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
mars_model <- train(
x = x_train,
y = y_train,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv")
)
mars_model
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.358846 0.4736659 1.0773044
## 1 3 1.221749 0.5615016 0.9872372
## 1 4 1.237128 0.5553756 0.9997244
## 1 5 1.266570 0.5295067 1.0275782
## 1 6 1.284562 0.5203949 1.0342975
## 1 7 1.273451 0.5390375 1.0041517
## 1 8 1.255157 0.5477567 0.9723961
## 1 9 1.260950 0.5322269 0.9852424
## 1 10 1.232531 0.5560667 0.9876363
## 1 11 1.238162 0.5487083 0.9746232
## 1 12 1.245077 0.5463104 0.9696214
## 1 13 1.243699 0.5490848 0.9714423
## 1 14 1.269269 0.5379033 0.9808639
## 1 15 1.268386 0.5369130 0.9779810
## 1 16 1.261876 0.5402092 0.9747751
## 1 17 1.262785 0.5392306 0.9771557
## 1 18 1.262785 0.5392306 0.9771557
## 1 19 1.262785 0.5392306 0.9771557
## 1 20 1.262785 0.5392306 0.9771557
## 1 21 1.262785 0.5392306 0.9771557
## 1 22 1.262785 0.5392306 0.9771557
## 1 23 1.262785 0.5392306 0.9771557
## 1 24 1.262785 0.5392306 0.9771557
## 1 25 1.262785 0.5392306 0.9771557
## 1 26 1.262785 0.5392306 0.9771557
## 1 27 1.262785 0.5392306 0.9771557
## 1 28 1.262785 0.5392306 0.9771557
## 1 29 1.262785 0.5392306 0.9771557
## 1 30 1.262785 0.5392306 0.9771557
## 1 31 1.262785 0.5392306 0.9771557
## 1 32 1.262785 0.5392306 0.9771557
## 1 33 1.262785 0.5392306 0.9771557
## 1 34 1.262785 0.5392306 0.9771557
## 1 35 1.262785 0.5392306 0.9771557
## 1 36 1.262785 0.5392306 0.9771557
## 1 37 1.262785 0.5392306 0.9771557
## 1 38 1.262785 0.5392306 0.9771557
## 2 2 1.358846 0.4736659 1.0773044
## 2 3 1.237063 0.5681071 1.0074004
## 2 4 1.222109 0.5758947 0.9771310
## 2 5 1.245134 0.5552182 0.9875915
## 2 6 1.364939 0.4650395 1.0745256
## 2 7 1.398639 0.4538580 1.0940460
## 2 8 1.407908 0.4489776 1.0994214
## 2 9 1.405588 0.4509498 1.0950718
## 2 10 1.369356 0.4729875 1.0481172
## 2 11 1.397721 0.4583520 1.0796357
## 2 12 1.378682 0.4749164 1.0511213
## 2 13 1.359249 0.4821028 1.0296717
## 2 14 1.345667 0.4941844 1.0238402
## 2 15 1.329379 0.5129324 1.0005878
## 2 16 1.358357 0.4971061 1.0225444
## 2 17 1.334597 0.5107436 0.9910424
## 2 18 1.314057 0.5179948 0.9768382
## 2 19 1.353755 0.5111639 1.0136810
## 2 20 1.322935 0.5269341 0.9912407
## 2 21 1.328934 0.5278357 1.0050881
## 2 22 1.332035 0.5270839 1.0045313
## 2 23 1.335219 0.5254110 1.0136223
## 2 24 1.332371 0.5267988 1.0114858
## 2 25 1.327000 0.5320395 1.0073512
## 2 26 1.327171 0.5343851 1.0107119
## 2 27 1.327171 0.5343851 1.0107119
## 2 28 1.327171 0.5343851 1.0107119
## 2 29 1.327171 0.5343851 1.0107119
## 2 30 1.327171 0.5343851 1.0107119
## 2 31 1.327171 0.5343851 1.0107119
## 2 32 1.327171 0.5343851 1.0107119
## 2 33 1.327171 0.5343851 1.0107119
## 2 34 1.327171 0.5343851 1.0107119
## 2 35 1.327171 0.5343851 1.0107119
## 2 36 1.327171 0.5343851 1.0107119
## 2 37 1.327171 0.5343851 1.0107119
## 2 38 1.327171 0.5343851 1.0107119
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
# SVM Radial
set.seed(200)
svm_model <- train(
x = x_train,
y = y_train,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv")
)
svm_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.363690 0.4910338 1.1102265
## 0.50 1.264912 0.5396158 1.0224011
## 1.00 1.169609 0.6060516 0.9313862
## 2.00 1.132682 0.6373329 0.8991576
## 4.00 1.129125 0.6401732 0.9010163
## 8.00 1.128698 0.6383315 0.8946774
## 16.00 1.123462 0.6414746 0.8898824
## 32.00 1.123462 0.6414746 0.8898824
## 64.00 1.123462 0.6414746 0.8898824
## 128.00 1.123462 0.6414746 0.8898824
## 256.00 1.123462 0.6414746 0.8898824
## 512.00 1.123462 0.6414746 0.8898824
## 1024.00 1.123462 0.6414746 0.8898824
## 2048.00 1.123462 0.6414746 0.8898824
##
## Tuning parameter 'sigma' was held constant at a value of 0.01364473
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01364473 and C = 16.
# KNN
set.seed(200)
knn_model <- train(
x = x_train,
y = y_train,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv")
)
knn_model
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 129, 129, 130, 131, 131, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.211973 0.5736136 0.9610462
## 7 1.251504 0.5430299 0.9936143
## 9 1.280785 0.5216002 1.0177954
## 11 1.304234 0.5067294 1.0444028
## 13 1.341687 0.4797393 1.0793004
## 15 1.368752 0.4557780 1.0925164
## 17 1.360050 0.4613685 1.0910232
## 19 1.372151 0.4589683 1.1076356
## 21 1.392773 0.4451492 1.1284405
## 23 1.384681 0.4581734 1.1157607
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Compare all models via resampling and test set performance
results <- resamples(list(
NeuralNet = nnet_model,
MARS = mars_model,
SVM = svm_model,
KNN = knn_model
))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: NeuralNet, MARS, SVM, KNN
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NeuralNet 0.9020430 0.9305145 1.0824023 1.1611743 1.2390664 2.032501 0
## MARS 0.8272437 0.8876048 0.9199464 0.9872372 1.0003701 1.380789 0
## SVM 0.6441756 0.7438694 0.8932662 0.8898824 0.9980542 1.197383 0
## KNN 0.6154667 0.8431071 0.9252051 0.9610462 1.0807872 1.445846 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NeuralNet 1.0116140 1.1301716 1.331673 1.419732 1.503636 2.554810 0
## MARS 1.0359253 1.0810183 1.148108 1.221749 1.236913 1.838730 0
## SVM 0.8062012 0.8782177 1.067869 1.123462 1.268242 1.884572 0
## KNN 0.8854909 1.0221954 1.142363 1.211973 1.293895 1.988005 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NeuralNet 0.06229665 0.4146912 0.5461120 0.5061298 0.6234512 0.7291930 0
## MARS 0.37619797 0.4779541 0.5645000 0.5615016 0.6573148 0.6903550 0
## SVM 0.32326221 0.5300366 0.6847175 0.6414746 0.7831806 0.8035011 0
## KNN 0.27046464 0.4533846 0.6180339 0.5736136 0.6796181 0.7832414 0
postResample(predict(nnet_model, x_test), y_test)
## RMSE Rsquared MAE
## 2.380885 0.205016 1.610862
postResample(predict(mars_model, x_test), y_test)
## RMSE Rsquared MAE
## 1.4045894 0.5224818 1.0821086
postResample(predict(svm_model, x_test), y_test)
## RMSE Rsquared MAE
## 1.3289206 0.6243324 0.9940246
postResample(predict(knn_model, x_test), y_test)
## RMSE Rsquared MAE
## 1.5958081 0.4470237 1.2768125
Based on both the cross-validation results and the test set performance, the SVM Radial model performed the best overall.
From the resampling results, SVM had the lowest average RMSE (1.1235), the highest average R-squared (0.6415), and the lowest MAE (0.8899) compared to Neural Networks, MARS, and KNN. This shows that SVM gave the most consistent performance during cross-validation.
The test set results also confirmed that SVM was the strongest model. Its test RMSE was 1.3289 with an R-squared of 0.6243 and MAE of 0.9940, which was better than the other models:
Even though MARS performed well, SVM had better overall test performance and handled the nonlinear relationships in the chemical manufacturing process data more effectively.
Overall, the SVM Radial model was the best nonlinear regression model for this dataset.
# Variable importance from best model
test_rmse <- c(
NeuralNet = postResample(predict(nnet_model, x_test), y_test)[["RMSE"]],
MARS = postResample(predict(mars_model, x_test), y_test)[["RMSE"]],
SVM = postResample(predict(svm_model, x_test), y_test)[["RMSE"]],
KNN = postResample(predict(knn_model, x_test), y_test)[["RMSE"]]
)
best_name <- names(which.min(test_rmse))
best_name
## [1] "SVM"
model_list <- list(
NeuralNet = nnet_model,
MARS = mars_model,
SVM = svm_model,
KNN = knn_model
)
best_model <- model_list[[best_name]]
varImp(best_model)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 87.81
## ManufacturingProcess13 78.23
## BiologicalMaterial03 76.45
## BiologicalMaterial12 69.16
## ManufacturingProcess17 68.44
## ManufacturingProcess31 67.32
## ManufacturingProcess36 67.18
## ManufacturingProcess09 64.12
## ManufacturingProcess06 60.80
## BiologicalMaterial02 53.50
## ManufacturingProcess29 53.10
## BiologicalMaterial11 48.66
## ManufacturingProcess11 48.25
## ManufacturingProcess33 46.61
## ManufacturingProcess30 45.38
## BiologicalMaterial09 38.34
## BiologicalMaterial04 37.89
## BiologicalMaterial08 36.89
## ManufacturingProcess12 36.65
The most important predictors in the best nonlinear model (SVM Radial) were mainly a mix of both manufacturing process variables and biological material variables.
The top predictors were:
From these results, manufacturing process variables appear to dominate the list since most of the top ten predictors come from the ManufacturingProcess group. However, several biological variables such as BiologicalMaterial06, BiologicalMaterial03, and BiologicalMaterial12 were also highly important, showing that both groups contribute to predicting yield.
Compared to the optimal linear model from Exercise 6.3, many of the same predictors still appear important, especially ManufacturingProcess variables like 32, 17, and 13. This suggests that these variables have strong influence on yield regardless of whether a linear or nonlinear model is used.
Overall, the process variables were slightly more dominant, but both biological and process predictors were important in explaining the final product yield.
# top 10 predictors vs yield
imp <- varImp(best_model)$importance
imp$Predictor <- rownames(imp)
imp$Overall <- imp[, 1]
imp <- imp[order(-imp$Overall), ]
top_names <- imp$Predictor[1:10]
par(mfrow = c(2, 5))
for (v in top_names) {
plot(
x_train[, v],
y_train,
xlab = v,
ylab = "Yield",
main = v
)
abline(lm(y_train ~ x_train[, v]))
}
The plots show that several of the top predictors have clear relationships with yield, especially ManufacturingProcess32, ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess09.
Some predictors show a positive relationship with yield, such as ManufacturingProcess32, BiologicalMaterial06, and ManufacturingProcess09, where higher values are generally associated with higher yield. Other predictors, such as ManufacturingProcess13, ManufacturingProcess17, and ManufacturingProcess31, show a negative relationship, where larger values are associated with lower yield.
A few predictors, like ManufacturingProcess36, show weaker or less consistent patterns, which suggests that their effect may be more complex or may depend on interactions with other variables rather than a simple straight-line relationship.
These plots suggest that manufacturing process variables have a stronger and more direct impact on yield than biological material variables. This makes sense because process settings often directly control production efficiency and final output quality. Biological variables still matter, but their effects appear less dominant compared to the process variables.
Overall, the plots support the idea that both biological and process predictors affect yield, but manufacturing process variables seem to have the strongest influence in the nonlinear model.