Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations uses the following nonlinear equation:
\[ y = 10 \sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2) \]
The x values are random variables uniformly distributed
between \([0,1]\). There are also 5
additional non-informative variables.
library(mlbench)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
# Convert predictors to data frame
trainingData$x <- data.frame(trainingData$x)
# Explore relationships
featurePlot(trainingData$x, trainingData$y)
#Generate test data
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
#Model KNN
library(caret)
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10
)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
#Prediction and Performance
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
KNN performs okay on this dataset (RMSE ≈ 3.23, R² ≈ 0.69), but it is not the best choice. This is because the data has nonlinear relationships and includes irrelevant variables, which reduce KNN’s accuracy. Models like random forests, boosting, and MARS usually perform better since they can capture nonlinear patterns and are less affected by unimportant predictors. MARS also helps identify the important variables, correctly focusing on X1–X5 and mostly ignoring the irrelevant ones (X6–X10).
7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Which nonlinear regression model gives the optimal resampling and test set performance?
The random forest (RF) model performed the best overall. It has the lowest RMSE (~1.10) and lowest MAE (~0.82), meaning it makes the smallest prediction errors. It also has a high R² (~0.69), so it explains a large portion of the variation in the outcome. This indicates strong and consistent performance.
The gradient boosting model (GBM) is a close second. It has a slightly lower average RMSE in some cases, but overall it is a bit less stable than random forest across resamples. Its R² (~0.70) is comparable, showing it also captures the relationship well.
The support vector machine (SVM) shows moderate performance. Its errors are higher (RMSE ~1.27, MAE ~1.03), and its R² (~0.60) is lower, meaning it does not explain the data as well as RF or GBM.
The k-nearest neighbors (KNN) model performs the worst. It has the highest RMSE (~1.56) and lowest R² (~0.33), indicating poor predictive accuracy and weak ability to capture the underlying patterns.
library(caret)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Warning: package 'gbm' was built under R version 4.5.3
## Loaded gbm 2.2.3
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3
data(ChemicalManufacturingProcess)
set.seed(123)
data <- na.omit(ChemicalManufacturingProcess)
trainIndex <- createDataPartition(data$Yield, p = 0.8, list = FALSE)
train <- data[trainIndex, ]
test <- data[-trainIndex, ]
ctrl <- trainControl(method = "cv", number = 10)
rf <- train(Yield ~ ., data = train, method = "rf", trControl = ctrl)
gbm <- train(Yield ~ ., data = train, method = "gbm", trControl = ctrl, verbose = FALSE)
svm <- train(Yield ~ ., data = train, method = "svmRadial", trControl = ctrl)
knn <- train(Yield ~ ., data = train, method = "knn", trControl = ctrl)
summary(resamples(list(RF = rf, GBM = gbm, SVM = svm, KNN = knn)))
##
## Call:
## summary.resamples(object = resamples(list(RF = rf, GBM = gbm, SVM = svm, KNN
## = knn)))
##
## Models: RF, GBM, SVM, KNN
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.3830538 0.6319365 0.8483928 0.8187291 1.0268606 1.145548 0
## GBM 0.5377941 0.7052098 0.8410083 0.8380573 0.9012233 1.256273 0
## SVM 0.6004475 0.9335970 1.0065137 1.0264916 1.2184641 1.338004 0
## KNN 0.9923333 1.2016667 1.2435000 1.2844390 1.3959083 1.721000 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.4882926 0.9848000 1.169438 1.097606 1.325000 1.365387 0
## GBM 0.6678843 0.8494942 1.122050 1.073312 1.235495 1.595767 0
## SVM 0.6548414 1.1794567 1.225895 1.265852 1.554666 1.646018 0
## KNN 1.1244237 1.3705753 1.612358 1.562215 1.781860 2.006747 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.35414121 0.5905339 0.6968129 0.6878912 0.8031957 0.9196428 0
## GBM 0.40394079 0.6384142 0.7536110 0.7039434 0.8350270 0.8823697 0
## SVM 0.30116886 0.4680684 0.6331726 0.5972150 0.7119231 0.8688396 0
## KNN 0.03427829 0.1729105 0.2684638 0.3253176 0.4835366 0.7358602 0
postResample(predict(rf, test), test$Yield)
## RMSE Rsquared MAE
## 1.0599536 0.7227313 0.8999873
postResample(predict(gbm, test), test$Yield)
## RMSE Rsquared MAE
## 1.1527547 0.6419587 0.9580847
postResample(predict(svm, test), test$Yield)
## RMSE Rsquared MAE
## 1.1472997 0.6776595 0.9196813
postResample(predict(knn, test), test$Yield)
## RMSE Rsquared MAE
## 1.5217635 0.3923186 1.1051429Which predictors are most important? Do the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
The variable importance results from the random forest model show that the most important predictors are largely manufacturing process variables, with features such as ManufacturingProcess32, ManufacturingProcess13, and ManufacturingProcess31 ranking the highest. Several biological variables (e.g., BiologicalMaterial03 and BiologicalMaterial06) also appear in the top ten, but they are generally less important than the top process variables. This indicates that process conditions play a more dominant role in determining yield than the biological inputs.
Compared to the optimal linear model, there is some overlap in the important predictors, but the random forest places greater emphasis on process variables. This difference suggests that the nonlinear model is better able to capture complex relationships and interactions among process variables that are not fully represented in the linear model.
# Variable importance from RF
imp <- varImp(rf)
# View top variables
imp
## rf variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 40.910
## ManufacturingProcess31 22.111
## ManufacturingProcess17 19.409
## BiologicalMaterial03 19.085
## BiologicalMaterial06 15.855
## ManufacturingProcess09 15.188
## BiologicalMaterial11 12.115
## BiologicalMaterial12 9.090
## BiologicalMaterial05 8.268
## BiologicalMaterial04 7.665
## ManufacturingProcess36 6.915
## ManufacturingProcess25 5.789
## ManufacturingProcess15 5.533
## ManufacturingProcess24 5.371
## ManufacturingProcess01 5.071
## ManufacturingProcess30 4.958
## ManufacturingProcess06 4.575
## BiologicalMaterial02 4.496
## BiologicalMaterial01 4.329
# Top 10 predictors
top10 <- head(imp$importance[order(imp$importance$Overall, decreasing = TRUE), ], 10)
top10
## [1] 100.000000 40.909840 22.111140 19.409362 19.084825 15.854998
## [7] 15.187843 12.115352 9.089606 8.267685
# Plot importance
plot(imp, top = 10)
Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
# Example variables (adjust if needed)
vars <- c("ManufacturingProcess32",
"ManufacturingProcess13",
"ManufacturingProcess31")
# Scatterplots
featurePlot(x = train[, vars],
y = train$Yield,
plot = "scatter")
The scatterplots of the top predictors unique to the nonlinear model show that the relationships between these manufacturing process variables and yield are complex and not strictly linear. For example, ManufacturingProcess32 displays a clear positive trend, where higher values are generally associated with higher yield, but the spread of the data suggests variability and possible nonlinear effects. ManufacturingProcess13 shows a weaker and more dispersed relationship, indicating that its impact on yield may depend on interactions with other variables rather than acting independently. ManufacturingProcess31 exhibits a clustered pattern, with most observations concentrated in a narrow range, suggesting a threshold or limited effect unless combined with other factors.
Overall, these plots reveal that process variables have more structured and complex relationships with yield, including potential nonlinearities and interactions. This provides intuition for why nonlinear models such as random forests perform better than linear models, as they are able to capture these more complicated patterns.