Exercise 7.2 is a classic way to see how non-linear models handle noise. The Friedman simulation includes 10 predictors, but only X_1 through X_5 actually influence the outcome y. The rest are just distractions for the model.
Creating a synthetic dataset where I use a mathematical formula to create the output Y based on specific inputs X. Generating random numbers, set.seed(200) Create 200 samples for training to teach the models and 5,000 samples for testing to see if the models actually learned.
The equation only uses X1 through X5. The simulation will give us 10 variables X1 through X10. We make sure if the models in Chapter 7 are smart enough to ignore the 5 noise variables.
Prepare Data
Recreating the training and test sets
# Load the tools
library(mlbench)
library(caret)
# Set seed for reproducibility
set.seed(200)
# Generating training data (200 samples)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
# Generating a large test set (5000 samples) to get a precise error rate
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
# 4. Quick look at the data
head(trainingData$x)
## X1 X2 X3 X4 X5 X6 X7
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060 0.8266609
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931 0.6489601
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911 0.1785614
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637 0.5133614
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157 0.6644906
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891 0.3360117
## X8 X9 X10
## 1 0.4214081 0.59111440 0.5886216
## 2 0.8446239 0.92819306 0.7584008
## 3 0.3495908 0.01759542 0.4441185
## 4 0.7970260 0.68986918 0.4450716
## 5 0.9038919 0.39696995 0.5500808
## 6 0.6489177 0.53116033 0.9066182
Using the featurePlot function from the caret package creating a Scatterplot Matrix where every single predictor X1 through X10 is plotted against the outcome Y.
# This creates a grid of scatter plots
# 'x' is our predictors, 'y' is our target
# type = c("p", "smooth") adds a trend line to help see patterns
featurePlot(x = trainingData$x,
y = trainingData$y,
plot = "scatter",
type = c("p", "smooth"),
span = .5,
layout = c(5, 2))
The featurePlot confirms that X1 through X2 are the only informative
variables because they show clear sloped or curved patterns, while X6
through X10 appear as flat, random clouds, proving they are just noise
added to distract the model.
Training the K-Nearest Neighbors (KNN) Model
Now training the model K-Nearest Neighbors (KNN), which looks at the data points in the training set that are geographically closest to the new point you want to predict and averages their values.
# Defining how how we want to test the model (5-fold Cross-Validation for speed)
ctrl <- trainControl(method = "cv", number = 5)
# Train the model
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = ctrl)
# View the results
print(knnModel)
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.310218 0.5640866 2.695817
## 7 3.183598 0.6130231 2.575609
## 9 3.133750 0.6548230 2.519030
## 11 3.135078 0.6653443 2.528709
## 13 3.106943 0.6907083 2.503302
## 15 3.081233 0.7116627 2.465985
## 17 3.100552 0.7236138 2.499389
## 19 3.169291 0.7099245 2.562107
## 21 3.192033 0.7134604 2.586728
## 23 3.216710 0.7143357 2.601841
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 15.
# Check the performance on the Test Set
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.1750657 0.6785946 2.5443169
The shows that k=15 is the optimal value for your model, producing a balanced test RMSE of 3.17 and confirming that the model captures about 68% of the relationship in the data without being too distracted by the noise.
KNN Tuning Plot
This plot shows how the error changes as we add more neighbors. You noticed K=15 was the best since the error is lowest.
library(ggplot2)
ggplot(knnModel) +
theme_bw() +
labs(title="KNN Model: Finding the Optimal K",
x="Number of Neighbors (K)",
y="RMSE (Cross-Validation)")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the caret package.
## Please report the issue at <https://github.com/topepo/caret/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Multivariate Adaptive Regression Splines (MARS)
Now training a MARS model using the earth package using technique that doesn’t try to fit one big curved line to the data. Instead, it breaks the data into segments and fits a piece of a straight line to each segment using hinge functions aka called hockey-stick functions.
By setting degree = 2, we tell MARS to look for cases where two variables interact with each other.
# Define the tuning grid
# 'degree' 1 is additive, 2 includes interactions
# 'nprune' is the maximum number of terms in the final model
marsGrid <- expand.grid(degree = 1:2, nprune = 2:20)
# Train the MARS model
marsModel <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
trControl = ctrl) # Using the same 'ctrl' from Step 2
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
# View the best tuning parameters
print(marsModel)
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.245901 0.2787321 3.483428
## 1 3 3.583229 0.4805401 2.881311
## 1 4 2.657913 0.7084960 2.125487
## 1 5 2.335060 0.7744494 1.893765
## 1 6 2.322011 0.7813465 1.833316
## 1 7 1.855381 0.8615696 1.450558
## 1 8 1.764554 0.8750919 1.382519
## 1 9 1.781814 0.8724492 1.364755
## 1 10 1.759272 0.8752761 1.376871
## 1 11 1.729588 0.8800043 1.351833
## 1 12 1.655616 0.8902637 1.281361
## 1 13 1.681933 0.8890678 1.315516
## 1 14 1.681012 0.8898265 1.320421
## 1 15 1.681012 0.8898265 1.320421
## 1 16 1.681012 0.8898265 1.320421
## 1 17 1.681012 0.8898265 1.320421
## 1 18 1.681012 0.8898265 1.320421
## 1 19 1.681012 0.8898265 1.320421
## 1 20 1.681012 0.8898265 1.320421
## 2 2 4.245901 0.2787321 3.483428
## 2 3 3.564742 0.4849456 2.842040
## 2 4 2.713767 0.7010239 2.143635
## 2 5 2.305559 0.7822752 1.864245
## 2 6 2.176343 0.8045383 1.717513
## 2 7 1.865123 0.8608544 1.430603
## 2 8 1.770984 0.8719387 1.403682
## 2 9 1.550917 0.9059240 1.229484
## 2 10 1.436948 0.9186192 1.080197
## 2 11 1.411424 0.9220546 1.086060
## 2 12 1.407339 0.9238775 1.093442
## 2 13 1.369094 0.9295236 1.074004
## 2 14 1.392390 0.9286478 1.100633
## 2 15 1.381626 0.9310271 1.093632
## 2 16 1.397686 0.9289366 1.110832
## 2 17 1.397686 0.9289366 1.110832
## 2 18 1.400270 0.9286182 1.107542
## 2 19 1.400270 0.9286182 1.107542
## 2 20 1.400270 0.9286182 1.107542
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 13 and degree = 2.
# Check test set performance
marsPred <- predict(marsModel, newdata = testData$x)
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
MARS significantly outperformed KNN by reducing the RMSE from 3.18 to 1.28, as its ability to model second-degree interactions (degree = 2) much more accurately captured the underlying non-linear signal of the Friedman equation.
MARS Tuning Plot
This plot shows two lines since MARS has two knobs (degree and nprune).It proves that allowing for interactions (degree = 2) dropped the error much faster than a simple model.
ggplot(marsModel) +
theme_bw() +
labs(title="MARS Model: Pruning and Interaction Results",
x="Number of Retained Terms (nprune)",
y="RMSE (Cross-Validation)")
Support Vector Machines (SVM)
Training a Support Vector Machine (SVM) using a Radial Basis Function (RBF) kernel trying to find a path that fits the data points. If a point falls inside this path, the model ignores the error; it only focuses on the difficult points that fall outside the Support Vectors.
# Train the SVM model
# 'svmRadial' is the most common non-linear SVM method
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"), # Vital for SVM
tuneLength = 10, # Tells R to try 10 combos of Cost and Sigma
trControl = ctrl) # Uses the 5-fold CV we created earlier
# View the best tuning parameters
print(svmModel)
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 160, 160, 160, 160, 160
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.582302 0.7882522 2.049558
## 0.50 2.316851 0.8002566 1.830721
## 1.00 2.145594 0.8200534 1.668402
## 2.00 2.076104 0.8277830 1.620225
## 4.00 2.001859 0.8383882 1.569451
## 8.00 1.982921 0.8421005 1.544838
## 16.00 1.988037 0.8412043 1.544945
## 32.00 1.988037 0.8412043 1.544945
## 64.00 1.988037 0.8412043 1.544945
## 128.00 1.988037 0.8412043 1.544945
##
## Tuning parameter 'sigma' was held constant at a value of 0.06406091
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06406091 and C = 8.
# Predict on the large test set
svmPred <- predict(svmModel, newdata = testData$x)
postResample(pred = svmPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.058684 0.828299 1.562474
While the SVM with its Kernel Trick performed significantly better than the baseline KNN, MARS remains the clear winner with a much lower RMSE, 1.28 vs 2.06, proving that segmenting the data into specific hinge interactions is the most effective way to solve the complex Friedman equation.
SVM Tuning Plot
This plot visualizes the Cost (C) parameter where the error drop as the Cost increases, then level off once the model has found the best path to fit the data.
ggplot(svmModel) +
scale_x_log10() + # SVM costs are often best viewed on a log scale
theme_bw() +
labs(title="SVM Model: Impact of Cost (C) on Performance",
x="Cost (Log Scale)",
y="RMSE (Cross-Validation)")
Neural Networks (Universal Approximator)
Now training a Neural Network which is a series of layers where each layer tries to find a different part of the pattern. It takes your X variables, passes them through hidden units, and then combines them into a final prediction for Y.
library(caret)
library(nnet)
# Faster tuning to avoid long running
nnGrid <- expand.grid(size = c(1, 3, 5),
decay = c(0, 0.01, 0.1))
nnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "nnet",
preProc = c("center", "scale"),
tuneGrid = nnGrid,
linout = TRUE,
trace = FALSE,
maxit = 100,
trControl = ctrl) # Uses the 'ctrl' made earlier in code
# Check the results
nnPred <- predict(nnModel, newdata = testData$x)
postResample(pred = nnPred, obs = testData$y)
## RMSE Rsquared MAE
## 2.6432570 0.7194544 2.0232018
Neural Network Tuning Plot
This plot is a grid. It shows how different combinations of Hidden Units (Size) and Weight Decay work together to find the signal in Friedman data.
ggplot(nnModel) +
theme_bw() +
labs(title="Neural Network: Hidden Units & Weight Decay Optimization",
x="Number of Hidden Units",
y="RMSE (Cross-Validation)")
Goal: We are looking for the lowest point on the line. That point
represents the Optimal model settings.
Interpretation: MARS is the best moddel because its significantly lower RMSE of 1.28 and high R2 of 0.93 prove it successfully captured the underlying non-linear equation and interaction terms, whereas the SVM (2.06), Neural Network (2.92), and KNN (3.17) models were notably less precise.
the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
Data Cleaning
Loading the AppliedPredictiveModeling package, which contains the chemical dataset.
Since the raw data has missing values (NA) and some columns where the numbers never change (Near-Zero Variance), we are using preProcess to handle it. knnImpute fills in the gaps, and nzv removes the dead weight predictors that don’t help the model learn.
library(AppliedPredictiveModeling)
library(caret)
data(ChemicalManufacturingProcess)
# We exclude column 1 (Yield) because we only clean the predictors
cleaning_plan <- preProcess(ChemicalManufacturingProcess[, -1],
method = c("knnImpute", "nzv"))
# Applying the cleaning to the predictors
chemPredictors <- predict(cleaning_plan, ChemicalManufacturingProcess[, -1])
chemYield <- ChemicalManufacturingProcess$Yield
Data Splitting and Control
Using createDataPartition to ensure the distribution of our target variable, Yield, is similar in both sets. We also define a trainControl object, telling R to use 10-fold Cross-Validation for every model we train later, ensuring our performance estimates are reliable.
# Set seed for reproducibility
set.seed(100)
# Creating an index for an 80/20 split
trainIndex <- createDataPartition(chemYield, p = 0.8, list = FALSE)
# Creating the actual training and testing sets
trainX <- chemPredictors[trainIndex, ]
trainY <- chemYield[trainIndex]
testX <- chemPredictors[-trainIndex, ]
testY <- chemYield[-trainIndex]
# Define the 'rules' for training (10-fold Cross-Validation)
ctrl <- trainControl(method = "cv", number = 10)
Visualizing the Target Variable (Yield)
library(ggplot2)
# Creating a histogram of the target variable
ggplot(data.frame(Yield = chemYield), aes(x = Yield)) +
geom_histogram(fill = "steelblue", color = "white", bins = 20) +
theme_minimal() +
labs(title = "Distribution of Chemical Manufacturing Yield",
x = "Yield (%)", y = "Frequency")
Visualizing Missing Data (The “Why” for Imputation)
# Checking for missing values in the raw data (before we imputed)
# This will show you a bar chart of which variables had the most NAs
data(ChemicalManufacturingProcess)
missing_counts <- colSums(is.na(ChemicalManufacturingProcess))
missing_data <- data.frame(Variable = names(missing_counts), NAs = missing_counts)
missing_data <- missing_data[missing_data$NAs > 0, ]
ggplot(missing_data, aes(x = reorder(Variable, NAs), y = NAs)) +
geom_bar(stat = "identity", fill = "salmon") +
coord_flip() +
theme_minimal() +
labs(title = "Variables with Missing Values", x = "Predictor", y = "Number of NAs")
Visualizing Predictor Relationships
# Plotting the first 4 predictors against Yield
featurePlot(x = trainX[, 1:4],
y = trainY,
plot = "scatter",
type = c("p", "smooth"),
layout = c(2, 2))
K-Nearest Neighbors (KNN)
Training our first non-linear model, K-Nearest Neighbors (KNN), on the cleaned chemical data and asking the model to predict the Yield of a batch by looking at the 10-fold cross-validation results to find the best value for K.
# Train the KNN model
# We already cleaned the data in Step 1, so we just run the training
set.seed(100)
knnModel_chem <- train(x = trainX,
y = trainY,
method = "knn",
preProc = c("center", "scale"), # Standard for KNN
tuneLength = 10,
trControl = ctrl)
# View the tuning results
print(knnModel_chem)
## k-Nearest Neighbors
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.371596 0.4680392 1.092167
## 7 1.408345 0.4483731 1.133631
## 9 1.419246 0.4494570 1.154292
## 11 1.418249 0.4469231 1.164150
## 13 1.431516 0.4368412 1.181147
## 15 1.440135 0.4442015 1.185132
## 17 1.450507 0.4469780 1.190494
## 19 1.461363 0.4466034 1.197818
## 21 1.473885 0.4403578 1.194148
## 23 1.475068 0.4489935 1.193789
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Check performance on the test set
knnPred_chem <- predict(knnModel_chem, newdata = testX)
postResample(pred = knnPred_chem, obs = testY)
## RMSE Rsquared MAE
## 1.2255603 0.4635978 0.9771250
library(ggplot2)
# Create the tuning profile plot
ggplot(knnModel_chem) +
theme_bw() +
labs(title = "KNN Tuning Profile: Chemical Manufacturing Yield",
x = "Number of Neighbors (K)",
y = "RMSE (Cross-Validation)") +
geom_point(color = "darkblue", size = 2) +
geom_line(color = "steelblue")
The KNN tuning profile for the chemical manufacturing dataset identifies
k=5 as the optimal neighbor count, producing a test RMSE of 1.23 and
indicating that the model’s accuracy decreases as more neighbors are
included in the prediction.
The upward trend in chart, means that as we increase the number of neighbors making the model smoother or more generalized, the predictions actually get worse.
Multivariate Adaptive Regression Splines (MARS)
MARS doesn’t just look at neighbors; it builds hinges (segmented linear relationships) to fit the data.
By testing degree = 1 (additive) and degree = 2 (interactions), we can see if certain manufacturing steps only matter when combined with specific raw materials.
# Define the tuning grid
# We test different levels of pruning (nprune) and interactions (degree)
marsGrid_chem <- expand.grid(degree = 1:2, nprune = 2:38)
# Train the MARS model
set.seed(100)
marsModel_chem <- train(x = trainX,
y = trainY,
method = "earth",
tuneGrid = marsGrid_chem,
trControl = ctrl)
# View the best tuning results
print(marsModel_chem)
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.410080 0.4643855 1.1074276
## 1 3 1.281137 0.5293133 1.0345230
## 1 4 1.209121 0.5878085 0.9729039
## 1 5 1.225140 0.5762346 0.9793405
## 1 6 1.282601 0.5290458 1.0173925
## 1 7 1.309220 0.5014459 1.0308723
## 1 8 1.329401 0.4980445 1.0842932
## 1 9 1.334423 0.5080099 1.0847531
## 1 10 1.354758 0.4999791 1.1049413
## 1 11 1.363315 0.4984635 1.1140661
## 1 12 1.305802 0.5303189 1.0723186
## 1 13 1.326919 0.5178015 1.0835536
## 1 14 1.331249 0.5329869 1.0872012
## 1 15 1.333421 0.5249447 1.0733957
## 1 16 1.335131 0.5284241 1.0784748
## 1 17 1.335131 0.5284241 1.0784748
## 1 18 1.335131 0.5284241 1.0784748
## 1 19 1.335131 0.5284241 1.0784748
## 1 20 1.335131 0.5284241 1.0784748
## 1 21 1.335131 0.5284241 1.0784748
## 1 22 1.335131 0.5284241 1.0784748
## 1 23 1.335131 0.5284241 1.0784748
## 1 24 1.335131 0.5284241 1.0784748
## 1 25 1.335131 0.5284241 1.0784748
## 1 26 1.335131 0.5284241 1.0784748
## 1 27 1.335131 0.5284241 1.0784748
## 1 28 1.335131 0.5284241 1.0784748
## 1 29 1.335131 0.5284241 1.0784748
## 1 30 1.335131 0.5284241 1.0784748
## 1 31 1.335131 0.5284241 1.0784748
## 1 32 1.335131 0.5284241 1.0784748
## 1 33 1.335131 0.5284241 1.0784748
## 1 34 1.335131 0.5284241 1.0784748
## 1 35 1.335131 0.5284241 1.0784748
## 1 36 1.335131 0.5284241 1.0784748
## 1 37 1.335131 0.5284241 1.0784748
## 1 38 1.335131 0.5284241 1.0784748
## 2 2 1.410080 0.4643855 1.1074276
## 2 3 1.311424 0.5279942 1.0575544
## 2 4 1.225004 0.5866982 0.9912320
## 2 5 1.251020 0.5667185 1.0134387
## 2 6 1.265207 0.5698098 1.0291324
## 2 7 1.311435 0.5499015 1.0810013
## 2 8 1.324169 0.5317795 1.0819529
## 2 9 1.316123 0.5405667 1.0768382
## 2 10 1.346390 0.5296667 1.0936183
## 2 11 1.332605 0.5351786 1.0982778
## 2 12 1.338161 0.5323383 1.0919929
## 2 13 1.357093 0.5358739 1.0901525
## 2 14 1.335572 0.5478569 1.0644462
## 2 15 1.339080 0.5402747 1.0650611
## 2 16 1.370622 0.5232940 1.1067013
## 2 17 1.392308 0.5195618 1.1357506
## 2 18 1.404623 0.5140885 1.1503101
## 2 19 1.431108 0.5118381 1.1823986
## 2 20 1.447531 0.5036852 1.1928017
## 2 21 1.465251 0.5043899 1.1984829
## 2 22 1.492220 0.4981039 1.2287422
## 2 23 1.493332 0.4965699 1.2319614
## 2 24 1.499200 0.4931392 1.2360606
## 2 25 1.499200 0.4931392 1.2360606
## 2 26 1.499200 0.4931392 1.2360606
## 2 27 1.499200 0.4931392 1.2360606
## 2 28 1.499200 0.4931392 1.2360606
## 2 29 1.499200 0.4931392 1.2360606
## 2 30 1.499200 0.4931392 1.2360606
## 2 31 1.499200 0.4931392 1.2360606
## 2 32 1.499200 0.4931392 1.2360606
## 2 33 1.499200 0.4931392 1.2360606
## 2 34 1.499200 0.4931392 1.2360606
## 2 35 1.499200 0.4931392 1.2360606
## 2 36 1.499200 0.4931392 1.2360606
## 2 37 1.499200 0.4931392 1.2360606
## 2 38 1.499200 0.4931392 1.2360606
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 4 and degree = 1.
# Check performance on the test set
marsPred_chem <- predict(marsModel_chem, newdata = testX)
postResample(pred = marsPred_chem, obs = testY)
## RMSE Rsquared MAE
## 1.1657015 0.5172382 0.9320261
MARS achieved a superior RMSE of 1.16 by simplifying the 56 predictors into just four additive hinge terms, demonstrating that a sparse, interpretable model captures the chemical yield more effectively than the denser KNN approach.
# Create the Visualization
# This plot shows the Cross-Validated RMSE for different numbers of terms
ggplot(marsModel_chem) +
theme_bw() +
labs(title = "MARS Tuning Profile: Chemical Manufacturing Yield",
subtitle = "Optimal model selected at nprune = 4 and degree = 1",
x = "Number of Retained Terms (nprune)",
y = "RMSE (Cross-Validation)") +
scale_color_brewer(palette = "Set1")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
The chart proves that a sparse model (few variables) with no
interactions is the most robust way to handle this specific chemical
process. The lowest point on the entire graph is the red dot at nprune =
4. This represents the most accurate version of the model.
Support Vector Machines (SVM)
Training a Support Vector Machine (SVM) with a Radial Basis Function (RBF) Kernel. While KNN looks at neighbors and MARS looks at hinges, SVM tries to find a path that surrounds the data points in a way that minimizes error while staying as flat as possible.
# Train the SVM model
# We use tuneLength = 10 to let caret automatically pick 10 values for C
set.seed(100)
svmModel_chem <- train(x = trainX,
y = trainY,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = ctrl)
# View the tuning results
print(svmModel_chem)
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.465282 0.4608086 1.1861159
## 0.50 1.344515 0.5138932 1.0938197
## 1.00 1.234392 0.5756859 1.0030231
## 2.00 1.190829 0.5847420 0.9497286
## 4.00 1.164055 0.5944899 0.9168254
## 8.00 1.144467 0.6115905 0.9007322
## 16.00 1.138240 0.6167046 0.8951823
## 32.00 1.138240 0.6167046 0.8951823
## 64.00 1.138240 0.6167046 0.8951823
## 128.00 1.138240 0.6167046 0.8951823
##
## Tuning parameter 'sigma' was held constant at a value of 0.01229856
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01229856 and C = 16.
# Check performance on the test set
svmPred_chem <- predict(svmModel_chem, newdata = testX)
postResample(pred = svmPred_chem, obs = testY)
## RMSE Rsquared MAE
## 1.0743186 0.5863003 0.8466954
The SVM model effectively leverages its radial kernel to achieve the lowest test error so far (RMSE = 1.07), outperforming both the MARS and KNN models by more accurately capturing the complex, high-dimensional relationships within the chemical manufacturing data.
library(caret)
library(ggplot2)
# Create the tuning profile plot
# scale_x_log10() is used because SVM costs usually vary exponentially (0.25, 1, 4, 16...)
ggplot(svmModel_chem) +
scale_x_log10() +
theme_bw() +
labs(title = "SVM Tuning Profile: Chemical Manufacturing Yield",
subtitle = "Optimal Cost (C) = 16 with Sigma ≈ 0.012",
x = "Cost (Log Scale)",
y = "RMSE (Cross-Validation)") +
geom_line(color = "darkblue") +
geom_point(color = "darkblue", size = 2)
Neural Networks
Training an Averaged Neural Network (avNNet) that trains one brain, an averaged network trains several (usually 5) starting from different random points and averages their predictions. This makes the model much more stable
# Define a targeted tuning grid
nnGrid <- expand.grid(size = c(1, 3, 5),
decay = c(0.01, 0.1),
bag = FALSE) # 'bag' is for bagging; we'll keep it simple
# Train the Averaged Neural Network
set.seed(100)
nnetModel_chem <- train(x = trainX,
y = trainY,
method = "avNNet",
preProc = c("center", "scale"),
tuneGrid = nnGrid,
linout = TRUE, # Needed for regression
trace = FALSE, # Stops R from printing every iteration
maxit = 500, # Maximum iterations
trControl = ctrl)
# View results
print(nnetModel_chem)
## Model Averaged Neural Network
##
## 144 samples
## 56 predictor
##
## Pre-processing: centered (56), scaled (56)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 130, 130, 130, 130, 130, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.01 1.448551 0.4969369 1.163080
## 1 0.10 1.398942 0.4930722 1.107411
## 3 0.01 1.727227 0.4427842 1.327945
## 3 0.10 1.708530 0.3985569 1.313031
## 5 0.01 1.765199 0.3668670 1.369516
## 5 0.10 1.708666 0.4279471 1.325957
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.1 and bag = FALSE.
# Check test set performance
nnetPred_chem <- predict(nnetModel_chem, newdata = testX)
postResample(pred = nnetPred_chem, obs = testY)
## RMSE Rsquared MAE
## 3.0387059 0.1038209 1.5871214
The SVM model is the optimal choice for the chemical manufacturing dataset because its RMSE of 1.074 signifies superior predictive accuracy, effectively outperforming MARS (1.165), KNN (1.225), and the significantly underperforming Neural Network (3.038).
sion model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
To identify which variables drive the chemical manufacturing process, we calculate the Variable Importance for your optimal model (SVM).
Calculating and Visualizing Variable Importance
Since SVM model used a Radial Basis Function (RBF) kernel, importance is determined by how much the model’s accuracy (RMSE) would drop if a specific predictor were removed or shuffled.
We use the varImp() function from the caret package, which standardizes these scores on a scale of 0 to 100.
# Calculating variable importance for the SVM model
svmImp <- varImp(svmModel_chem)
# Plotting the top 10 most important predictors
plot(svmImp, top = 10, main = "Top 10 Predictors: SVM Model")
# View the raw scores for the top predictors
print(svmImp)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 97.84
## BiologicalMaterial06 82.22
## ManufacturingProcess17 77.27
## BiologicalMaterial03 76.21
## ManufacturingProcess36 70.77
## BiologicalMaterial02 68.79
## ManufacturingProcess09 67.86
## BiologicalMaterial12 63.36
## ManufacturingProcess06 55.15
## BiologicalMaterial04 54.31
## ManufacturingProcess33 49.26
## ManufacturingProcess31 47.73
## ManufacturingProcess11 45.72
## BiologicalMaterial11 42.44
## BiologicalMaterial08 41.89
## ManufacturingProcess29 41.28
## BiologicalMaterial01 41.19
## BiologicalMaterial09 39.70
## ManufacturingProcess02 36.69
SVM model has successfully filtered through the 56 variables to find the specific drivers of manufacturing yield.
Dominance Analysis
Manufacturing Process variables clearly dominate the list
Show 6 Process variables versus 4 Biological materials.
ManufacturingProcess32 is significantly more important than anything else, acting as the primary lever for controlling yield in this non-linear model.
Comparison with the Linear Model (PLS - Partial Least Squares)
we need to see how these results differ from the Optimal Linear Model (for this dataset is usually PLS - Partial Least Squares). Linear models and non-linear models often disagree because non-linear models can see patterns (like curves or spikes) that linear models miss.
# Train the optimal Linear Model (PLS) for comparison
set.seed(100)
plsModel_chem <- train(x = trainX,
y = trainY,
method = "pls",
tuneLength = 20,
trControl = ctrl)
# Calculate and plot importance for the Linear Model
plsImp <- varImp(plsModel_chem)
## Warning: package 'pls' was built under R version 4.3.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
plot(plsImp, top = 10, main = "Top 10 Predictors: Linear (PLS) Model")
# Print the top 10 scores to compare side-by-side
print(plsImp)
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 77.89
## ManufacturingProcess09 76.11
## ManufacturingProcess36 74.57
## ManufacturingProcess17 74.09
## BiologicalMaterial02 60.74
## BiologicalMaterial06 58.95
## ManufacturingProcess33 58.83
## BiologicalMaterial03 55.13
## ManufacturingProcess06 54.88
## ManufacturingProcess12 53.51
## BiologicalMaterial08 53.46
## BiologicalMaterial12 52.65
## BiologicalMaterial11 52.03
## ManufacturingProcess11 49.83
## BiologicalMaterial04 48.30
## BiologicalMaterial01 47.45
## ManufacturingProcess28 45.61
## ManufacturingProcess04 42.22
## ManufacturingProcess15 33.15
The Model Performance (Part A)
Support Vector Machines (SVM) is the optimal model with the lowest error (RMSE 1.07).
MARS performed well (RMSE 1.16) by simplifying the 56 variables into just 4 key hinges.
KNN was decent (RMSE 1.22) but struggled to filter out the noise in the large predictor set.
Neural Networks failed significantly (RMSE 3.03), likely because the dataset (144 samples) is too small for a network to learn effectively without overfitting.
Variable Importance (Part B)
Manufacturing Process variables dominate the top of the list, accounting for 6 of the top 10 predictors.
ManufacturingProcess32 is the main driver of yield in every model we tested.
While process variables lead, BiologicalMaterial06 is a major factor that the non-linear model values much more highly than the linear model.
Linear vs. Non-Linear: The main difference is that the SVM identified complex, curved patterns in biological materials that the Linear (PLS) model simply couldn’t see.
the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
To answer we need to look for the predictors that the SVM (Non-Linear) model found important but the Linear (PLS) model completely overlooked.
The Linear Model (PLS): Only sees things that go up or down in a straight line. If a relationship is curved, this model gets confused and thinks that variable isn’t important.
The Non-Linear Model (SVM): Can see curves, waves, and sudden jumps. It understands complex patterns.
The “Difference” Test
To find out which predictors have those complex curves, we look at the Top 10 lists from both models and see where they disagree.
Looking at BiologicalMaterial12:
In SVM list, it is in the Top 10.
In PLS list, it is not found.
The Logic: If the straight-line model ignored it, but the curve-seeing model keep it, BiologicalMaterial12 must have a curved relationship with yield.
Look at BiologicalMaterial06:
The Linear model (PLS) thought it was okay and ranked #7.
The Non-Linear model (SVM) thought it was critical and ranked #3.
The SVM found extra information in this variable that the linear model missed. It didn’t just see a slight trend; it saw a complex pattern that makes it much more valuable for predicting yield.
Creating the Scatter Plots
We will use ggplot2 to plot Yield against these two specific predictors. The importance is visible by the LOESS smoother (the blue line in the code below). It doesn’t force a straight line; it follows the data wherever it goes, revealing if there is a curve or a flat spot.
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.3.3
# Prepare a small data frame for plotting
# We combine the predictors and the Yield into one table
plot_df <- as.data.frame(trainX)
plot_df$Yield <- trainY
# Plot BiologicalMaterial12 (The 'Hidden Gem')
p1 <- ggplot(plot_df, aes(x = BiologicalMaterial12, y = Yield)) +
geom_point(alpha = 0.4) + # Points are slightly see-through to show density
geom_smooth(method = "loess", color = "blue", se = TRUE) + # The 'curve-finder'
theme_minimal() +
labs(title = "SVM Unique: BiologicalMaterial12",
x = "BiologicalMaterial12 Value",
y = "Yield")
# Plot BiologicalMaterial06 (The 'Rank Jumper')
p2 <- ggplot(plot_df, aes(x = BiologicalMaterial06, y = Yield)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", color = "red", se = TRUE) +
theme_minimal() +
labs(title = "SVM Rank Jumper: BiologicalMaterial06",
x = "BiologicalMaterial06 Value",
y = "Yield")
# Display them side-by-side
grid.arrange(p1, p2, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
BiologicalMaterial12:
A linear model tries to draw one straight diagonal line through this. Because the line goes up then down, the linear model sees an average of zero and thinks the variable is useless. The SVM sees the switch and uses it to make better predictions.
BiologicalMaterial06:
The linear model only sees that the data generally goes up from left to right, so it completely misses the fact that excessive amounts are actually harmful.
The plots reveal that these predictors exhibit distinct threshold effects and optimal ranges where yield eventually saturates or declines at high values—non-linear complexities that the SVM model identifies and exploits while the linear model’s straight-line assumption remains blind to them.