7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data…

Exercise 7.2 is a classic way to see how non-linear models handle noise. The Friedman simulation includes 10 predictors, but only X_1 through X_5 actually influence the outcome y. The rest are just distractions for the model.

Creating a synthetic dataset where I use a mathematical formula to create the output Y based on specific inputs X. Generating random numbers, set.seed(200) Create 200 samples for training to teach the models and 5,000 samples for testing to see if the models actually learned.

The equation only uses X1 through X5. The simulation will give us 10 variables X1 through X10. We make sure if the models in Chapter 7 are smart enough to ignore the 5 noise variables.

Prepare Data

Recreating the training and test sets

# Load the tools
library(mlbench)
library(caret)

# Set seed for reproducibility
set.seed(200)

# Generating training data (200 samples)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

# Generating a large test set (5000 samples) to get a precise error rate
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

# 4. Quick look at the data
head(trainingData$x)
##          X1        X2         X3         X4         X5         X6        X7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
##          X8         X9       X10
## 1 0.4214081 0.59111440 0.5886216
## 2 0.8446239 0.92819306 0.7584008
## 3 0.3495908 0.01759542 0.4441185
## 4 0.7970260 0.68986918 0.4450716
## 5 0.9038919 0.39696995 0.5500808
## 6 0.6489177 0.53116033 0.9066182

Using the featurePlot function from the caret package creating a Scatterplot Matrix where every single predictor X1 through X10 is plotted against the outcome Y.

# This creates a grid of scatter plots
# 'x' is our predictors, 'y' is our target
# type = c("p", "smooth") adds a trend line to help see patterns
featurePlot(x = trainingData$x, 
            y = trainingData$y, 
            plot = "scatter",
            type = c("p", "smooth"),
            span = .5,
            layout = c(5, 2))

The featurePlot confirms that X1 through X2 are the only informative variables because they show clear sloped or curved patterns, while X6 through X10 appear as flat, random clouds, proving they are just noise added to distract the model.

Training the K-Nearest Neighbors (KNN) Model

Now training the model K-Nearest Neighbors (KNN), which looks at the data points in the training set that are geographically closest to the new point you want to predict and averages their values.

# Defining how how we want to test the model (5-fold Cross-Validation for speed)
ctrl <- trainControl(method = "cv", number = 5)

# Train the model
knnModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = ctrl)

# View the results
print(knnModel)
## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.310218  0.5640866  2.695817
##    7  3.183598  0.6130231  2.575609
##    9  3.133750  0.6548230  2.519030
##   11  3.135078  0.6653443  2.528709
##   13  3.106943  0.6907083  2.503302
##   15  3.081233  0.7116627  2.465985
##   17  3.100552  0.7236138  2.499389
##   19  3.169291  0.7099245  2.562107
##   21  3.192033  0.7134604  2.586728
##   23  3.216710  0.7143357  2.601841
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
# Check the performance on the Test Set
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 3.1750657 0.6785946 2.5443169

The shows that k=15 is the optimal value for your model, producing a balanced test RMSE of 3.17 and confirming that the model captures about 68% of the relationship in the data without being too distracted by the noise.

KNN Tuning Plot

This plot shows how the error changes as we add more neighbors. You noticed K=15 was the best since the error is lowest.

library(ggplot2)
ggplot(knnModel) + 
  theme_bw() +
  labs(title="KNN Model: Finding the Optimal K",
       x="Number of Neighbors (K)",
       y="RMSE (Cross-Validation)")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the caret package.
##   Please report the issue at <https://github.com/topepo/caret/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Multivariate Adaptive Regression Splines (MARS)

Now training a MARS model using the earth package using technique that doesn’t try to fit one big curved line to the data. Instead, it breaks the data into segments and fits a piece of a straight line to each segment using hinge functions aka called hockey-stick functions.

By setting degree = 2, we tell MARS to look for cases where two variables interact with each other.

# Define the tuning grid
# 'degree' 1 is additive, 2 includes interactions
# 'nprune' is the maximum number of terms in the final model
marsGrid <- expand.grid(degree = 1:2, nprune = 2:20)

# Train the MARS model
marsModel <- train(x = trainingData$x, 
                   y = trainingData$y,
                   method = "earth",
                   tuneGrid = marsGrid,
                   trControl = ctrl) # Using the same 'ctrl' from Step 2
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
# View the best tuning parameters
print(marsModel)
## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.245901  0.2787321  3.483428
##   1        3      3.583229  0.4805401  2.881311
##   1        4      2.657913  0.7084960  2.125487
##   1        5      2.335060  0.7744494  1.893765
##   1        6      2.322011  0.7813465  1.833316
##   1        7      1.855381  0.8615696  1.450558
##   1        8      1.764554  0.8750919  1.382519
##   1        9      1.781814  0.8724492  1.364755
##   1       10      1.759272  0.8752761  1.376871
##   1       11      1.729588  0.8800043  1.351833
##   1       12      1.655616  0.8902637  1.281361
##   1       13      1.681933  0.8890678  1.315516
##   1       14      1.681012  0.8898265  1.320421
##   1       15      1.681012  0.8898265  1.320421
##   1       16      1.681012  0.8898265  1.320421
##   1       17      1.681012  0.8898265  1.320421
##   1       18      1.681012  0.8898265  1.320421
##   1       19      1.681012  0.8898265  1.320421
##   1       20      1.681012  0.8898265  1.320421
##   2        2      4.245901  0.2787321  3.483428
##   2        3      3.564742  0.4849456  2.842040
##   2        4      2.713767  0.7010239  2.143635
##   2        5      2.305559  0.7822752  1.864245
##   2        6      2.176343  0.8045383  1.717513
##   2        7      1.865123  0.8608544  1.430603
##   2        8      1.770984  0.8719387  1.403682
##   2        9      1.550917  0.9059240  1.229484
##   2       10      1.436948  0.9186192  1.080197
##   2       11      1.411424  0.9220546  1.086060
##   2       12      1.407339  0.9238775  1.093442
##   2       13      1.369094  0.9295236  1.074004
##   2       14      1.392390  0.9286478  1.100633
##   2       15      1.381626  0.9310271  1.093632
##   2       16      1.397686  0.9289366  1.110832
##   2       17      1.397686  0.9289366  1.110832
##   2       18      1.400270  0.9286182  1.107542
##   2       19      1.400270  0.9286182  1.107542
##   2       20      1.400270  0.9286182  1.107542
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.
# Check test set performance
marsPred <- predict(marsModel, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

MARS significantly outperformed KNN by reducing the RMSE from 3.18 to 1.28, as its ability to model second-degree interactions (degree = 2) much more accurately captured the underlying non-linear signal of the Friedman equation.

MARS Tuning Plot

This plot shows two lines since MARS has two knobs (degree and nprune).It proves that allowing for interactions (degree = 2) dropped the error much faster than a simple model.

ggplot(marsModel) + 
  theme_bw() +
  labs(title="MARS Model: Pruning and Interaction Results",
       x="Number of Retained Terms (nprune)",
       y="RMSE (Cross-Validation)")

Support Vector Machines (SVM)

Training a Support Vector Machine (SVM) using a Radial Basis Function (RBF) kernel trying to find a path that fits the data points. If a point falls inside this path, the model ignores the error; it only focuses on the difficult points that fall outside the Support Vectors.

# Train the SVM model
# 'svmRadial' is the most common non-linear SVM method
svmModel <- train(x = trainingData$x, 
                  y = trainingData$y,
                  method = "svmRadial",
                  preProc = c("center", "scale"), # Vital for SVM
                  tuneLength = 10, # Tells R to try 10 combos of Cost and Sigma
                  trControl = ctrl) # Uses the 5-fold CV we created earlier

# View the best tuning parameters
print(svmModel)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 160, 160, 160, 160, 160 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.582302  0.7882522  2.049558
##     0.50  2.316851  0.8002566  1.830721
##     1.00  2.145594  0.8200534  1.668402
##     2.00  2.076104  0.8277830  1.620225
##     4.00  2.001859  0.8383882  1.569451
##     8.00  1.982921  0.8421005  1.544838
##    16.00  1.988037  0.8412043  1.544945
##    32.00  1.988037  0.8412043  1.544945
##    64.00  1.988037  0.8412043  1.544945
##   128.00  1.988037  0.8412043  1.544945
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06406091
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06406091 and C = 8.
# Predict on the large test set
svmPred <- predict(svmModel, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
##     RMSE Rsquared      MAE 
## 2.058684 0.828299 1.562474

While the SVM with its Kernel Trick performed significantly better than the baseline KNN, MARS remains the clear winner with a much lower RMSE, 1.28 vs 2.06, proving that segmenting the data into specific hinge interactions is the most effective way to solve the complex Friedman equation.

SVM Tuning Plot

This plot visualizes the Cost (C) parameter where the error drop as the Cost increases, then level off once the model has found the best path to fit the data.

ggplot(svmModel) + 
  scale_x_log10() + # SVM costs are often best viewed on a log scale
  theme_bw() +
  labs(title="SVM Model: Impact of Cost (C) on Performance",
       x="Cost (Log Scale)",
       y="RMSE (Cross-Validation)")

Neural Networks (Universal Approximator)

Now training a Neural Network which is a series of layers where each layer tries to find a different part of the pattern. It takes your X variables, passes them through hidden units, and then combines them into a final prediction for Y.

library(caret)
library(nnet)

# Faster tuning to avoid long running
nnGrid <- expand.grid(size = c(1, 3, 5), 
                      decay = c(0, 0.01, 0.1))

nnModel <- train(x = trainingData$x, 
                 y = trainingData$y,
                 method = "nnet",
                 preProc = c("center", "scale"),
                 tuneGrid = nnGrid,
                 linout = TRUE, 
                 trace = FALSE, 
                 maxit = 100, 
                 trControl = ctrl) # Uses the 'ctrl' made earlier in code

# Check the results
nnPred <- predict(nnModel, newdata = testData$x)
postResample(pred = nnPred, obs = testData$y)
##      RMSE  Rsquared       MAE 
## 2.6432570 0.7194544 2.0232018

Neural Network Tuning Plot

This plot is a grid. It shows how different combinations of Hidden Units (Size) and Weight Decay work together to find the signal in Friedman data.

ggplot(nnModel) + 
  theme_bw() +
  labs(title="Neural Network: Hidden Units & Weight Decay Optimization",
       x="Number of Hidden Units",
       y="RMSE (Cross-Validation)")

Goal: We are looking for the lowest point on the line. That point represents the Optimal model settings.

Interpretation: MARS is the best moddel because its significantly lower RMSE of 1.28 and high R2 of 0.93 prove it successfully captured the underlying non-linear equation and interaction terms, whereas the SVM (2.06), Neural Network (2.92), and KNN (3.17) models were notably less precise.

7.5. Exercise 6.3 describes data for a chemical manufacturing process. Use

the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

Data Cleaning

Loading the AppliedPredictiveModeling package, which contains the chemical dataset.

Since the raw data has missing values (NA) and some columns where the numbers never change (Near-Zero Variance), we are using preProcess to handle it. knnImpute fills in the gaps, and nzv removes the dead weight predictors that don’t help the model learn.

library(AppliedPredictiveModeling)
library(caret)

data(ChemicalManufacturingProcess)


# We exclude column 1 (Yield) because we only clean the predictors
cleaning_plan <- preProcess(ChemicalManufacturingProcess[, -1], 
                            method = c("knnImpute", "nzv"))

# Applying the cleaning to the predictors
chemPredictors <- predict(cleaning_plan, ChemicalManufacturingProcess[, -1])
chemYield <- ChemicalManufacturingProcess$Yield

Data Splitting and Control

Using createDataPartition to ensure the distribution of our target variable, Yield, is similar in both sets. We also define a trainControl object, telling R to use 10-fold Cross-Validation for every model we train later, ensuring our performance estimates are reliable.

# Set seed for reproducibility
set.seed(100)

# Creating an index for an 80/20 split
trainIndex <- createDataPartition(chemYield, p = 0.8, list = FALSE)

# Creating the actual training and testing sets
trainX <- chemPredictors[trainIndex, ]
trainY <- chemYield[trainIndex]
testX  <- chemPredictors[-trainIndex, ]
testY  <- chemYield[-trainIndex]

# Define the 'rules' for training (10-fold Cross-Validation)
ctrl <- trainControl(method = "cv", number = 10)

Visualizing the Target Variable (Yield)

library(ggplot2)

# Creating a histogram of the target variable
ggplot(data.frame(Yield = chemYield), aes(x = Yield)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 20) +
  theme_minimal() +
  labs(title = "Distribution of Chemical Manufacturing Yield",
       x = "Yield (%)", y = "Frequency")

Visualizing Missing Data (The “Why” for Imputation)

# Checking for missing values in the raw data (before we imputed)
# This will show you a bar chart of which variables had the most NAs
data(ChemicalManufacturingProcess)
missing_counts <- colSums(is.na(ChemicalManufacturingProcess))
missing_data <- data.frame(Variable = names(missing_counts), NAs = missing_counts)
missing_data <- missing_data[missing_data$NAs > 0, ]

ggplot(missing_data, aes(x = reorder(Variable, NAs), y = NAs)) +
  geom_bar(stat = "identity", fill = "salmon") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Variables with Missing Values", x = "Predictor", y = "Number of NAs")

Visualizing Predictor Relationships

# Plotting the first 4 predictors against Yield
featurePlot(x = trainX[, 1:4], 
            y = trainY, 
            plot = "scatter",
            type = c("p", "smooth"),
            layout = c(2, 2))

K-Nearest Neighbors (KNN)

Training our first non-linear model, K-Nearest Neighbors (KNN), on the cleaned chemical data and asking the model to predict the Yield of a batch by looking at the 10-fold cross-validation results to find the best value for K.

# Train the KNN model
# We already cleaned the data in Step 1, so we just run the training
set.seed(100)
knnModel_chem <- train(x = trainX, 
                       y = trainY,
                       method = "knn",
                       preProc = c("center", "scale"), # Standard for KNN
                       tuneLength = 10,
                       trControl = ctrl)

# View the tuning results
print(knnModel_chem)
## k-Nearest Neighbors 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  1.371596  0.4680392  1.092167
##    7  1.408345  0.4483731  1.133631
##    9  1.419246  0.4494570  1.154292
##   11  1.418249  0.4469231  1.164150
##   13  1.431516  0.4368412  1.181147
##   15  1.440135  0.4442015  1.185132
##   17  1.450507  0.4469780  1.190494
##   19  1.461363  0.4466034  1.197818
##   21  1.473885  0.4403578  1.194148
##   23  1.475068  0.4489935  1.193789
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Check performance on the test set
knnPred_chem <- predict(knnModel_chem, newdata = testX)
postResample(pred = knnPred_chem, obs = testY)
##      RMSE  Rsquared       MAE 
## 1.2255603 0.4635978 0.9771250
library(ggplot2)

# Create the tuning profile plot
ggplot(knnModel_chem) + 
  theme_bw() +
  labs(title = "KNN Tuning Profile: Chemical Manufacturing Yield",
       x = "Number of Neighbors (K)",
       y = "RMSE (Cross-Validation)") +
  geom_point(color = "darkblue", size = 2) +
  geom_line(color = "steelblue")

The KNN tuning profile for the chemical manufacturing dataset identifies k=5 as the optimal neighbor count, producing a test RMSE of 1.23 and indicating that the model’s accuracy decreases as more neighbors are included in the prediction.

The upward trend in chart, means that as we increase the number of neighbors making the model smoother or more generalized, the predictions actually get worse.

Multivariate Adaptive Regression Splines (MARS)

MARS doesn’t just look at neighbors; it builds hinges (segmented linear relationships) to fit the data.

By testing degree = 1 (additive) and degree = 2 (interactions), we can see if certain manufacturing steps only matter when combined with specific raw materials.

# Define the tuning grid
# We test different levels of pruning (nprune) and interactions (degree)
marsGrid_chem <- expand.grid(degree = 1:2, nprune = 2:38)

# Train the MARS model
set.seed(100)
marsModel_chem <- train(x = trainX, 
                        y = trainY,
                        method = "earth",
                        tuneGrid = marsGrid_chem,
                        trControl = ctrl)

# View the best tuning results
print(marsModel_chem)
## Multivariate Adaptive Regression Spline 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE      
##   1        2      1.410080  0.4643855  1.1074276
##   1        3      1.281137  0.5293133  1.0345230
##   1        4      1.209121  0.5878085  0.9729039
##   1        5      1.225140  0.5762346  0.9793405
##   1        6      1.282601  0.5290458  1.0173925
##   1        7      1.309220  0.5014459  1.0308723
##   1        8      1.329401  0.4980445  1.0842932
##   1        9      1.334423  0.5080099  1.0847531
##   1       10      1.354758  0.4999791  1.1049413
##   1       11      1.363315  0.4984635  1.1140661
##   1       12      1.305802  0.5303189  1.0723186
##   1       13      1.326919  0.5178015  1.0835536
##   1       14      1.331249  0.5329869  1.0872012
##   1       15      1.333421  0.5249447  1.0733957
##   1       16      1.335131  0.5284241  1.0784748
##   1       17      1.335131  0.5284241  1.0784748
##   1       18      1.335131  0.5284241  1.0784748
##   1       19      1.335131  0.5284241  1.0784748
##   1       20      1.335131  0.5284241  1.0784748
##   1       21      1.335131  0.5284241  1.0784748
##   1       22      1.335131  0.5284241  1.0784748
##   1       23      1.335131  0.5284241  1.0784748
##   1       24      1.335131  0.5284241  1.0784748
##   1       25      1.335131  0.5284241  1.0784748
##   1       26      1.335131  0.5284241  1.0784748
##   1       27      1.335131  0.5284241  1.0784748
##   1       28      1.335131  0.5284241  1.0784748
##   1       29      1.335131  0.5284241  1.0784748
##   1       30      1.335131  0.5284241  1.0784748
##   1       31      1.335131  0.5284241  1.0784748
##   1       32      1.335131  0.5284241  1.0784748
##   1       33      1.335131  0.5284241  1.0784748
##   1       34      1.335131  0.5284241  1.0784748
##   1       35      1.335131  0.5284241  1.0784748
##   1       36      1.335131  0.5284241  1.0784748
##   1       37      1.335131  0.5284241  1.0784748
##   1       38      1.335131  0.5284241  1.0784748
##   2        2      1.410080  0.4643855  1.1074276
##   2        3      1.311424  0.5279942  1.0575544
##   2        4      1.225004  0.5866982  0.9912320
##   2        5      1.251020  0.5667185  1.0134387
##   2        6      1.265207  0.5698098  1.0291324
##   2        7      1.311435  0.5499015  1.0810013
##   2        8      1.324169  0.5317795  1.0819529
##   2        9      1.316123  0.5405667  1.0768382
##   2       10      1.346390  0.5296667  1.0936183
##   2       11      1.332605  0.5351786  1.0982778
##   2       12      1.338161  0.5323383  1.0919929
##   2       13      1.357093  0.5358739  1.0901525
##   2       14      1.335572  0.5478569  1.0644462
##   2       15      1.339080  0.5402747  1.0650611
##   2       16      1.370622  0.5232940  1.1067013
##   2       17      1.392308  0.5195618  1.1357506
##   2       18      1.404623  0.5140885  1.1503101
##   2       19      1.431108  0.5118381  1.1823986
##   2       20      1.447531  0.5036852  1.1928017
##   2       21      1.465251  0.5043899  1.1984829
##   2       22      1.492220  0.4981039  1.2287422
##   2       23      1.493332  0.4965699  1.2319614
##   2       24      1.499200  0.4931392  1.2360606
##   2       25      1.499200  0.4931392  1.2360606
##   2       26      1.499200  0.4931392  1.2360606
##   2       27      1.499200  0.4931392  1.2360606
##   2       28      1.499200  0.4931392  1.2360606
##   2       29      1.499200  0.4931392  1.2360606
##   2       30      1.499200  0.4931392  1.2360606
##   2       31      1.499200  0.4931392  1.2360606
##   2       32      1.499200  0.4931392  1.2360606
##   2       33      1.499200  0.4931392  1.2360606
##   2       34      1.499200  0.4931392  1.2360606
##   2       35      1.499200  0.4931392  1.2360606
##   2       36      1.499200  0.4931392  1.2360606
##   2       37      1.499200  0.4931392  1.2360606
##   2       38      1.499200  0.4931392  1.2360606
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.
# Check performance on the test set
marsPred_chem <- predict(marsModel_chem, newdata = testX)
postResample(pred = marsPred_chem, obs = testY)
##      RMSE  Rsquared       MAE 
## 1.1657015 0.5172382 0.9320261

MARS achieved a superior RMSE of 1.16 by simplifying the 56 predictors into just four additive hinge terms, demonstrating that a sparse, interpretable model captures the chemical yield more effectively than the denser KNN approach.

# Create the Visualization
# This plot shows the Cross-Validated RMSE for different numbers of terms
ggplot(marsModel_chem) + 
  theme_bw() +
  labs(title = "MARS Tuning Profile: Chemical Manufacturing Yield",
       subtitle = "Optimal model selected at nprune = 4 and degree = 1",
       x = "Number of Retained Terms (nprune)",
       y = "RMSE (Cross-Validation)") +
  scale_color_brewer(palette = "Set1")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

The chart proves that a sparse model (few variables) with no interactions is the most robust way to handle this specific chemical process. The lowest point on the entire graph is the red dot at nprune = 4. This represents the most accurate version of the model.

Support Vector Machines (SVM)

Training a Support Vector Machine (SVM) with a Radial Basis Function (RBF) Kernel. While KNN looks at neighbors and MARS looks at hinges, SVM tries to find a path that surrounds the data points in a way that minimizes error while staying as flat as possible.

# Train the SVM model
# We use tuneLength = 10 to let caret automatically pick 10 values for C
set.seed(100)
svmModel_chem <- train(x = trainX, 
                        y = trainY,
                        method = "svmRadial",
                        preProc = c("center", "scale"),
                        tuneLength = 10,
                        trControl = ctrl)

# View the tuning results
print(svmModel_chem)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE      
##     0.25  1.465282  0.4608086  1.1861159
##     0.50  1.344515  0.5138932  1.0938197
##     1.00  1.234392  0.5756859  1.0030231
##     2.00  1.190829  0.5847420  0.9497286
##     4.00  1.164055  0.5944899  0.9168254
##     8.00  1.144467  0.6115905  0.9007322
##    16.00  1.138240  0.6167046  0.8951823
##    32.00  1.138240  0.6167046  0.8951823
##    64.00  1.138240  0.6167046  0.8951823
##   128.00  1.138240  0.6167046  0.8951823
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01229856
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01229856 and C = 16.
#  Check performance on the test set
svmPred_chem <- predict(svmModel_chem, newdata = testX)
postResample(pred = svmPred_chem, obs = testY)
##      RMSE  Rsquared       MAE 
## 1.0743186 0.5863003 0.8466954

The SVM model effectively leverages its radial kernel to achieve the lowest test error so far (RMSE = 1.07), outperforming both the MARS and KNN models by more accurately capturing the complex, high-dimensional relationships within the chemical manufacturing data.

library(caret)
library(ggplot2)

# Create the tuning profile plot
# scale_x_log10() is used because SVM costs usually vary exponentially (0.25, 1, 4, 16...)
ggplot(svmModel_chem) + 
  scale_x_log10() +
  theme_bw() +
  labs(title = "SVM Tuning Profile: Chemical Manufacturing Yield",
       subtitle = "Optimal Cost (C) = 16 with Sigma ≈ 0.012",
       x = "Cost (Log Scale)",
       y = "RMSE (Cross-Validation)") +
  geom_line(color = "darkblue") +
  geom_point(color = "darkblue", size = 2)

Neural Networks

Training an Averaged Neural Network (avNNet) that trains one brain, an averaged network trains several (usually 5) starting from different random points and averages their predictions. This makes the model much more stable

# Define a targeted tuning grid
nnGrid <- expand.grid(size = c(1, 3, 5), 
                      decay = c(0.01, 0.1),
                      bag = FALSE) # 'bag' is for bagging; we'll keep it simple

# Train the Averaged Neural Network
set.seed(100)
nnetModel_chem <- train(x = trainX, 
                        y = trainY,
                        method = "avNNet",
                        preProc = c("center", "scale"),
                        tuneGrid = nnGrid,
                        linout = TRUE, # Needed for regression
                        trace = FALSE, # Stops R from printing every iteration
                        maxit = 500,   # Maximum iterations
                        trControl = ctrl)

# View results
print(nnetModel_chem)
## Model Averaged Neural Network 
## 
## 144 samples
##  56 predictor
## 
## Pre-processing: centered (56), scaled (56) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##   1     0.01   1.448551  0.4969369  1.163080
##   1     0.10   1.398942  0.4930722  1.107411
##   3     0.01   1.727227  0.4427842  1.327945
##   3     0.10   1.708530  0.3985569  1.313031
##   5     0.01   1.765199  0.3668670  1.369516
##   5     0.10   1.708666  0.4279471  1.325957
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.1 and bag = FALSE.
# Check test set performance
nnetPred_chem <- predict(nnetModel_chem, newdata = testX)
postResample(pred = nnetPred_chem, obs = testY)
##      RMSE  Rsquared       MAE 
## 3.0387059 0.1038209 1.5871214

The SVM model is the optimal choice for the chemical manufacturing dataset because its RMSE of 1.074 signifies superior predictive accuracy, effectively outperforming MARS (1.165), KNN (1.225), and the significantly underperforming Neural Network (3.038).

(b) Which predictors are most important in the optimal nonlinear regres

sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

To identify which variables drive the chemical manufacturing process, we calculate the Variable Importance for your optimal model (SVM).

Calculating and Visualizing Variable Importance

Since SVM model used a Radial Basis Function (RBF) kernel, importance is determined by how much the model’s accuracy (RMSE) would drop if a specific predictor were removed or shuffled.

We use the varImp() function from the caret package, which standardizes these scores on a scale of 0 to 100.

# Calculating variable importance for the SVM model
svmImp <- varImp(svmModel_chem)

# Plotting the top 10 most important predictors
plot(svmImp, top = 10, main = "Top 10 Predictors: SVM Model")

# View the raw scores for the top predictors
print(svmImp)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   97.84
## BiologicalMaterial06     82.22
## ManufacturingProcess17   77.27
## BiologicalMaterial03     76.21
## ManufacturingProcess36   70.77
## BiologicalMaterial02     68.79
## ManufacturingProcess09   67.86
## BiologicalMaterial12     63.36
## ManufacturingProcess06   55.15
## BiologicalMaterial04     54.31
## ManufacturingProcess33   49.26
## ManufacturingProcess31   47.73
## ManufacturingProcess11   45.72
## BiologicalMaterial11     42.44
## BiologicalMaterial08     41.89
## ManufacturingProcess29   41.28
## BiologicalMaterial01     41.19
## BiologicalMaterial09     39.70
## ManufacturingProcess02   36.69

SVM model has successfully filtered through the 56 variables to find the specific drivers of manufacturing yield.

Dominance Analysis

Manufacturing Process variables clearly dominate the list

Show 6 Process variables versus 4 Biological materials.

ManufacturingProcess32 is significantly more important than anything else, acting as the primary lever for controlling yield in this non-linear model.

Comparison with the Linear Model (PLS - Partial Least Squares)

we need to see how these results differ from the Optimal Linear Model (for this dataset is usually PLS - Partial Least Squares). Linear models and non-linear models often disagree because non-linear models can see patterns (like curves or spikes) that linear models miss.

# Train the optimal Linear Model (PLS) for comparison
set.seed(100)
plsModel_chem <- train(x = trainX, 
                       y = trainY,
                       method = "pls",
                       tuneLength = 20,
                       trControl = ctrl)

# Calculate and plot importance for the Linear Model
plsImp <- varImp(plsModel_chem)
## Warning: package 'pls' was built under R version 4.3.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
plot(plsImp, top = 10, main = "Top 10 Predictors: Linear (PLS) Model")

# Print the top 10 scores to compare side-by-side
print(plsImp)
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   77.89
## ManufacturingProcess09   76.11
## ManufacturingProcess36   74.57
## ManufacturingProcess17   74.09
## BiologicalMaterial02     60.74
## BiologicalMaterial06     58.95
## ManufacturingProcess33   58.83
## BiologicalMaterial03     55.13
## ManufacturingProcess06   54.88
## ManufacturingProcess12   53.51
## BiologicalMaterial08     53.46
## BiologicalMaterial12     52.65
## BiologicalMaterial11     52.03
## ManufacturingProcess11   49.83
## BiologicalMaterial04     48.30
## BiologicalMaterial01     47.45
## ManufacturingProcess28   45.61
## ManufacturingProcess04   42.22
## ManufacturingProcess15   33.15

The Model Performance (Part A)

Support Vector Machines (SVM) is the optimal model with the lowest error (RMSE 1.07).

MARS performed well (RMSE 1.16) by simplifying the 56 variables into just 4 key hinges.

KNN was decent (RMSE 1.22) but struggled to filter out the noise in the large predictor set.

Neural Networks failed significantly (RMSE 3.03), likely because the dataset (144 samples) is too small for a network to learn effectively without overfitting.

Variable Importance (Part B)

Manufacturing Process variables dominate the top of the list, accounting for 6 of the top 10 predictors.

ManufacturingProcess32 is the main driver of yield in every model we tested.

While process variables lead, BiologicalMaterial06 is a major factor that the non-linear model values much more highly than the linear model.

Linear vs. Non-Linear: The main difference is that the SVM identified complex, curved patterns in biological materials that the Linear (PLS) model simply couldn’t see.

(c) Explore the relationships between the top predictors and the response for

the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

To answer we need to look for the predictors that the SVM (Non-Linear) model found important but the Linear (PLS) model completely overlooked.

The Linear Model (PLS): Only sees things that go up or down in a straight line. If a relationship is curved, this model gets confused and thinks that variable isn’t important.

The Non-Linear Model (SVM): Can see curves, waves, and sudden jumps. It understands complex patterns.

The “Difference” Test

To find out which predictors have those complex curves, we look at the Top 10 lists from both models and see where they disagree.

Looking at BiologicalMaterial12:

In SVM list, it is in the Top 10.

In PLS list, it is not found.

The Logic: If the straight-line model ignored it, but the curve-seeing model keep it, BiologicalMaterial12 must have a curved relationship with yield.

Look at BiologicalMaterial06:

The Linear model (PLS) thought it was okay and ranked #7.

The Non-Linear model (SVM) thought it was critical and ranked #3.

The SVM found extra information in this variable that the linear model missed. It didn’t just see a slight trend; it saw a complex pattern that makes it much more valuable for predicting yield.

Creating the Scatter Plots

We will use ggplot2 to plot Yield against these two specific predictors. The importance is visible by the LOESS smoother (the blue line in the code below). It doesn’t force a straight line; it follows the data wherever it goes, revealing if there is a curve or a flat spot.

library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.3
# Prepare a small data frame for plotting
# We combine the predictors and the Yield into one table
plot_df <- as.data.frame(trainX)
plot_df$Yield <- trainY

# Plot BiologicalMaterial12 (The 'Hidden Gem')
p1 <- ggplot(plot_df, aes(x = BiologicalMaterial12, y = Yield)) +
  geom_point(alpha = 0.4) + # Points are slightly see-through to show density
  geom_smooth(method = "loess", color = "blue", se = TRUE) + # The 'curve-finder'
  theme_minimal() +
  labs(title = "SVM Unique: BiologicalMaterial12",
       x = "BiologicalMaterial12 Value",
       y = "Yield")

# Plot BiologicalMaterial06 (The 'Rank Jumper')
p2 <- ggplot(plot_df, aes(x = BiologicalMaterial06, y = Yield)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", color = "red", se = TRUE) +
  theme_minimal() +
  labs(title = "SVM Rank Jumper: BiologicalMaterial06",
       x = "BiologicalMaterial06 Value",
       y = "Yield")

# Display them side-by-side
grid.arrange(p1, p2, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

BiologicalMaterial12:

A linear model tries to draw one straight diagonal line through this. Because the line goes up then down, the linear model sees an average of zero and thinks the variable is useless. The SVM sees the switch and uses it to make better predictions.

BiologicalMaterial06:

The linear model only sees that the data generally goes up from left to right, so it completely misses the fact that excessive amounts are actually harmful.

The plots reveal that these predictors exhibit distinct threshold effects and optimal ranges where yield eventually saturates or declines at high values—non-linear complexities that the SVM model identifies and exploits while the linear model’s straight-line assumption remains blind to them.