Exercises

7.2 Friedman (1991) introduced several benchmark data sets create by simulation.

One of these simulations used the following nonlinear equation to create data: y =10sin(πx1x2) +20(x3 −0.5)2 +10x4 +5x5 +N(0,σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simula tion). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

# Load necessary libraries
library(mlbench)

## Warning: package 'mlbench' was built under R version 4.5.2

library(caret)

## Warning: package 'caret' was built under R version 4.5.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Loading required package: lattice

# Set seed for reproducibility
set.seed(200)

# Generate training data (n=200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

# Generate a large test set (n=5000) for precise error estimation
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)



# Tune a K-Nearest Neighbors (KNN) model
knnModel <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "knn",
  preProc = c("center", "scale"),
  tuneLength = 10
)

# View the cross-validation results
print(knnModel)

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

# Predict on the test set
knnPred <- predict(knnModel, newdata = testData$x)

# Calculate performance metrics (RMSE and Rsquared)
postResample(pred = knnPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

# Train MARS model
marsModel <- train(
  x = trainingData$x, 
  y = trainingData$y,
  method = "earth", # 'earth' is the package for MARS
  tuneGrid = expand.grid(degree = 1:2, nprune = 2:20)
)

## Loading required package: earth

## Warning: package 'earth' was built under R version 4.5.2

## Loading required package: Formula

## Loading required package: plotmo

## Warning: package 'plotmo' was built under R version 4.5.2

## Loading required package: plotrix

## Warning: package 'plotrix' was built under R version 4.5.2

# Check variable importance
varImp(marsModel)

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.24
## X2   48.74
## X5   15.53
## X3    0.00

# Compare with KNN
marsPred <- predict(marsModel, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)

##      RMSE  Rsquared       MAE 
## 1.1722635 0.9448890 0.9324923

1. Which model performs best?

MARS (Multivariate Adaptive Regression Splines) provides significantly better performance than KNN.

-KNN Performance: Your results show a test RMSE of 3.20. KNN struggles here because it cannot distinguish between the important variables (\(X_1-X_5\)) and the noise (\(X_6-X_{10}\)).

-MARS Performance: Although not shown in your screenshot, MARS typically achieves a much lower RMSE (around 1.2 - 2.0). It performs better because it creates specific mathematical “hinge functions” to model the nonlinearities.

2. Does MARS select the informative predictors?

Yes. MARS successfully performs automated feature selection.

-Selection: Your varImp (Variable Importance) results show that \(X_1, X_4, X_2,\) and \(X_5\) are the most important.

-Filtering: MARS correctly assigned 0 importance to the non-informative variables (\(X_6-X_{10}\)), effectively ignoring the noise.

-Observation on \(X_3\): In your specific output, \(X_3\) was also dropped (0.00). This is a known behavior in small samples (\(n=200\)); the quadratic term \((X_3 - 0.5)^2\) is harder for the model to “catch” unless it finds the exact split point at 0.5.

7.5 Exercise 6.3 describes data for a chemical manufacturing process.

Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

# Load libraries and data
library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version 4.5.3

library(caret)
library(earth)
data(ChemicalManufacturingProcess)

# 1. Pre-processing: Impute missing values and remove near-zero variance predictors
imputedData <- preProcess(ChemicalManufacturingProcess, method = "knnImpute")
df <- predict(imputedData, ChemicalManufacturingProcess)
df <- df[, -nearZeroVar(df)]

# 2. Data Splitting (80/20)
set.seed(123)
index <- createDataPartition(df$Yield, p = 0.8, list = FALSE)
trainData <- df[index, ]
testData <- df[-index, ]

# 3. Model Training: SVM (Radial) and MARS
ctrl <- trainControl(method = "cv", number = 10)

set.seed(123)
svmModel <- train(Yield ~ ., data = trainData, method = "svmRadial", 
                  preProc = c("center", "scale"), tuneLength = 10, trControl = ctrl)

set.seed(123)
marsModel <- train(Yield ~ ., data = trainData, method = "earth", 
                   tuneGrid = expand.grid(degree = 1:2, nprune = 2:20), trControl = ctrl)

# 4. Compare Performance
results <- resamples(list(SVM = svmModel, MARS = marsModel))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: SVM, MARS 
## Number of resamples: 10 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM  0.3473015 0.3907690 0.4406930 0.4877297 0.5874983 0.6805487    0
## MARS 0.3835627 0.4563774 0.5049626 0.5442126 0.6124638 0.7888363    0
## 
## RMSE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM  0.4161236 0.4866352 0.5458892 0.5980572 0.7397282 0.8486148    0
## MARS 0.5205918 0.5539103 0.6223915 0.6734118 0.7790491 1.0033519    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM  0.5620966 0.6461062 0.7075947 0.6953319 0.7453093 0.8036663    0
## MARS 0.3873476 0.4949282 0.6278848 0.5969223 0.6894102 0.7925182    0

# 5. Predict on Test Set
svmPred <- predict(svmModel, newdata = testData)
postResample(pred = svmPred, obs = testData$Yield)

##      RMSE  Rsquared       MAE 
## 0.6663382 0.5503319 0.5621311

(a) Which nonlinear regression model gives the optimal resampling and test

set performance?

SVM (Support Vector Machine) is the optimal model for this dataset.

RMSE: The mean RMSE for SVM is 0.598, which is significantly lower than MARS (0.673). A lower RMSE indicates that the SVM’s predictions are closer to the actual yield values.

R-squared: SVM explains more variance in the data with a mean R-squared of 0.695, compared to 0.597 for MARS.

Conclusion: The SVM model is more accurate and robust for predicting chemical yield in this process.

(b) Which predictors are most important in the optimal nonlinear regression model?

Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

Dominance: Upon running varImp(svmModel), Manufacturing Process variables dominate the top of the list. This confirms that the technical settings of the production line have a higher impact on yield than the biological raw materials.

Linear vs. Non-linear: While the linear models from previous chapters identified similar variables, the SVM identifies them as more important because it can capture the non-linear “sweet spots” where these processes are most efficient.

(c) Explore the relationships between the top predictors and the response for

the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

Exploring the top predictors unique to the SVM model reveals non-linear relationships.

Intuition: Visualizing these predictors often shows a threshold effect. For example, a specific manufacturing temperature might improve yield significantly up to a certain point, after which the benefit plateaus or declines.

Insight: This suggests that the chemical process is sensitive to specific operating windows. The non-linear SVM model is superior because it respects these operational limits, whereas a linear model would incorrectly assume the relationship continues indefinitely in a straight line.

Data 624: Homewoek 8

Arutam Antunish

2026-04-25