624 HW 8

7.2 Friedman Simulation Data

Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations uses the following nonlinear equation:

\[ y = 10 \sin(\pi x_1 x_2) + 20(x_3 - 0.5)^2 + 10x_4 + 5x_5 + N(0, \sigma^2) \]

The x values are random variables uniformly distributed between \([0,1]\). There are also 5 additional non-informative variables.

Generate Training Data

library(mlbench)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)

# Convert predictors to data frame
trainingData$x <- data.frame(trainingData$x)

# Explore relationships
featurePlot(trainingData$x, trainingData$y)

#Generate test data

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

#Model KNN 
library(caret)

knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)

knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

#Prediction and Performance
knnPred <- predict(knnModel, newdata = testData$x)

postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

KNN performs okay on this dataset (RMSE ≈ 3.23, R² ≈ 0.69), but it is not the best choice. This is because the data has nonlinear relationships and includes irrelevant variables, which reduce KNN’s accuracy. Models like random forests, boosting, and MARS usually perform better since they can capture nonlinear patterns and are less affected by unimportant predictors. MARS also helps identify the important variables, correctly focusing on X1–X5 and mostly ignoring the irrelevant ones (X6–X10).

7.5 Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Which nonlinear regression model gives the optimal resampling and test set performance?

The random forest (RF) model performed the best overall. It has the lowest RMSE (~1.10) and lowest MAE (~0.82), meaning it makes the smallest prediction errors. It also has a high R² (~0.69), so it explains a large portion of the variation in the outcome. This indicates strong and consistent performance.

The gradient boosting model (GBM) is a close second. It has a slightly lower average RMSE in some cases, but overall it is a bit less stable than random forest across resamples. Its R² (~0.70) is comparable, showing it also captures the relationship well.

The support vector machine (SVM) shows moderate performance. Its errors are higher (RMSE ~1.27, MAE ~1.03), and its R² (~0.60) is lower, meaning it does not explain the data as well as RF or GBM.

The k-nearest neighbors (KNN) model performs the worst. It has the highest RMSE (~1.56) and lowest R² (~0.33), indicating poor predictive accuracy and weak ability to capture the underlying patterns.

library(caret)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.5.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm)

## Warning: package 'gbm' was built under R version 4.5.3

## Loaded gbm 2.2.3

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3

data(ChemicalManufacturingProcess)

set.seed(123)

data <- na.omit(ChemicalManufacturingProcess)

trainIndex <- createDataPartition(data$Yield, p = 0.8, list = FALSE)
train <- data[trainIndex, ]
test  <- data[-trainIndex, ]

ctrl <- trainControl(method = "cv", number = 10)

rf  <- train(Yield ~ ., data = train, method = "rf", trControl = ctrl)
gbm <- train(Yield ~ ., data = train, method = "gbm", trControl = ctrl, verbose = FALSE)
svm <- train(Yield ~ ., data = train, method = "svmRadial", trControl = ctrl)
knn <- train(Yield ~ ., data = train, method = "knn", trControl = ctrl)

summary(resamples(list(RF = rf, GBM = gbm, SVM = svm, KNN = knn)))

## 
## Call:
## summary.resamples(object = resamples(list(RF = rf, GBM = gbm, SVM = svm, KNN
##  = knn)))
## 
## Models: RF, GBM, SVM, KNN 
## Number of resamples: 10 
## 
## MAE 
##          Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## RF  0.3830538 0.6319365 0.8483928 0.8187291 1.0268606 1.145548    0
## GBM 0.5377941 0.7052098 0.8410083 0.8380573 0.9012233 1.256273    0
## SVM 0.6004475 0.9335970 1.0065137 1.0264916 1.2184641 1.338004    0
## KNN 0.9923333 1.2016667 1.2435000 1.2844390 1.3959083 1.721000    0
## 
## RMSE 
##          Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## RF  0.4882926 0.9848000 1.169438 1.097606 1.325000 1.365387    0
## GBM 0.6678843 0.8494942 1.122050 1.073312 1.235495 1.595767    0
## SVM 0.6548414 1.1794567 1.225895 1.265852 1.554666 1.646018    0
## KNN 1.1244237 1.3705753 1.612358 1.562215 1.781860 2.006747    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.35414121 0.5905339 0.6968129 0.6878912 0.8031957 0.9196428    0
## GBM 0.40394079 0.6384142 0.7536110 0.7039434 0.8350270 0.8823697    0
## SVM 0.30116886 0.4680684 0.6331726 0.5972150 0.7119231 0.8688396    0
## KNN 0.03427829 0.1729105 0.2684638 0.3253176 0.4835366 0.7358602    0

postResample(predict(rf, test),  test$Yield)

##      RMSE  Rsquared       MAE 
## 1.0599536 0.7227313 0.8999873

postResample(predict(gbm, test), test$Yield)

##      RMSE  Rsquared       MAE 
## 1.1527547 0.6419587 0.9580847

postResample(predict(svm, test), test$Yield)

##      RMSE  Rsquared       MAE 
## 1.1472997 0.6776595 0.9196813

postResample(predict(knn, test), test$Yield)

##      RMSE  Rsquared       MAE 
## 1.5217635 0.3923186 1.1051429

Which predictors are most important? Do the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The variable importance results from the random forest model show that the most important predictors are largely manufacturing process variables, with features such as ManufacturingProcess32, ManufacturingProcess13, and ManufacturingProcess31 ranking the highest. Several biological variables (e.g., BiologicalMaterial03 and BiologicalMaterial06) also appear in the top ten, but they are generally less important than the top process variables. This indicates that process conditions play a more dominant role in determining yield than the biological inputs.

Compared to the optimal linear model, there is some overlap in the important predictors, but the random forest places greater emphasis on process variables. This difference suggests that the nonlinear model is better able to capture complex relationships and interactions among process variables that are not fully represented in the linear model.

# Variable importance from RF
imp <- varImp(rf)

# View top variables
imp

## rf variable importance
## 
##   only 20 most important variables shown (out of 57)
## 
##                        Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13  40.910
## ManufacturingProcess31  22.111
## ManufacturingProcess17  19.409
## BiologicalMaterial03    19.085
## BiologicalMaterial06    15.855
## ManufacturingProcess09  15.188
## BiologicalMaterial11    12.115
## BiologicalMaterial12     9.090
## BiologicalMaterial05     8.268
## BiologicalMaterial04     7.665
## ManufacturingProcess36   6.915
## ManufacturingProcess25   5.789
## ManufacturingProcess15   5.533
## ManufacturingProcess24   5.371
## ManufacturingProcess01   5.071
## ManufacturingProcess30   4.958
## ManufacturingProcess06   4.575
## BiologicalMaterial02     4.496
## BiologicalMaterial01     4.329

# Top 10 predictors
top10 <- head(imp$importance[order(imp$importance$Overall, decreasing = TRUE), ], 10)
top10

##  [1] 100.000000  40.909840  22.111140  19.409362  19.084825  15.854998
##  [7]  15.187843  12.115352   9.089606   8.267685

# Plot importance
plot(imp, top = 10)

Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# Example variables (adjust if needed)
vars <- c("ManufacturingProcess32",
          "ManufacturingProcess13",
          "ManufacturingProcess31")

# Scatterplots
featurePlot(x = train[, vars],
            y = train$Yield,
            plot = "scatter")

The scatterplots of the top predictors unique to the nonlinear model show that the relationships between these manufacturing process variables and yield are complex and not strictly linear. For example, ManufacturingProcess32 displays a clear positive trend, where higher values are generally associated with higher yield, but the spread of the data suggests variability and possible nonlinear effects. ManufacturingProcess13 shows a weaker and more dispersed relationship, indicating that its impact on yield may depend on interactions with other variables rather than acting independently. ManufacturingProcess31 exhibits a clustered pattern, with most observations concentrated in a narrow range, suggesting a threshold or limited effect unless combined with other factors.

Overall, these plots reveal that process variables have more structured and complex relationships with yield, including potential nonlinearities and interactions. This provides intuition for why nonlinear models such as random forests perform better than linear models, as they are able to capture these more complicated patterns.

624 HW 8

Rebecca Bronstein

2026-04-26

7.2 Friedman Simulation Data

Generate Training Data