In this assignment, I explored nonlinear regression models using two different datasets. First, I worked with Friedman’s simulated benchmark data (Exercise 7.2), which challenges models with nonlinear patterns where only some predictors matter. Then, I pivoted to a real-world chemical manufacturing dataset (Exercise 7.5), where the goal was to uncover complex relationships between process variables and product yield.
For both, I compared several nonlinear approaches—KNN, neural nets, MARS, SVM—evaluating them with RMSE and R² on test sets. Along the way, I examined variable importance and visualized the influence of top predictors to get insight beyond just the numbers.
This assignment starts with the Friedman1 benchmark dataset, a simulated nonlinear regression problem. It’s got 10 predictors, but only the first five (X1 to X5) are informative. My goal here is to see how well different models can pick up on that pattern—and more importantly, which ones generalize well to unseen data.
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
featurePlot(trainingData$x[, 1:5], trainingData$y, plot = "scatter", layout = c(3,2))
KNN is simple and intuitive—it uses distance to make predictions. After training it, I noticed decent performance, but I expected it might struggle as complexity increases.
Here I’m testing a basic neural network using the nnet
package. I expected it to do better than KNN—but it’s known to be
sensitive to tuning and initialization.
nnGrid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10, .bag = FALSE)
set.seed(0)
nnetModel <- train(x = trainingData$x, y = trainingData$y, method = "nnet",
preProc = c("center", "scale"), linout = TRUE, trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x)+1) + 10 + 1, maxit = 500)
nnetPred <- predict(nnetModel, newdata = testData$x)
nnetPR <- postResample(pred = nnetPred, obs = testData$y)
This one’s like the upgraded version of a neural net—it fits multiple networks and averages them. This usually helps reduce variance and improves generalization. Let’s see if that holds up.
set.seed(0)
avNNetModel <- train(x = trainingData$x, y = trainingData$y, method = "avNNet",
preProc = c("center", "scale"), linout = TRUE, trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x)+1) + 10 + 1, maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
MARS (Multivariate Adaptive Regression Splines) builds flexible piecewise linear models. It’s interpretable and usually does well with nonlinear data—especially when you want to see variable importance.
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(0)
marsModel <- train(x = trainingData$x, y = trainingData$y, method = "earth",
preProc = c("center", "scale"), tuneGrid = marsGrid)
marsPred <- predict(marsModel, newdata = testData$x)
marsPR <- postResample(pred = marsPred, obs = testData$y)
SVMs with RBF kernels are powerful when the relationship between predictors and response is nonlinear and smooth. I expected this one to perform competitively.
Now it’s time to see who really understood the assignment. I pulled all the RMSE and R² values from the test set and lined them up to compare performance.
rmses <- c(knnPR[1], nnetPR[1], avNNetPR[1], marsPR[1], svmPR[1])
r2s <- c(knnPR[2], nnetPR[2], avNNetPR[2], marsPR[2], svmPR[2])
methods <- c("KNN", "NN", "AvgNN", "MARS", "SVM")
results <- data.frame(Model = methods, RMSE = rmses, R2 = r2s) %>%
arrange(desc(RMSE)) %>%
gt() %>%
tab_header(title = "Test Set Performance Metrics")
results
Test Set Performance Metrics | ||
Model | RMSE | R2 |
---|---|---|
KNN | 3.204059 | 0.6819919 |
NN | 2.649316 | 0.7177210 |
SVM | 2.059719 | 0.8279547 |
AvgNN | 2.055993 | 0.8323657 |
MARS | 1.322734 | 0.9291489 |
MARS gives us a peek behind the curtain: which predictors mattered most. This helps confirm if the model is focusing on the real drivers (X1–X5) or just noise.
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
This was a great exercise in comparing nonlinear models side-by-side. Here’s what stood out to me:
Overall, I’d reach for SVM or AvgNN if I wanted accuracy, and MARS if I wanted insight. This helped me see just how much model choice (and tuning) matters in predictive performance.
data(ChemicalManufacturingProcess)
processPredictors <- ChemicalManufacturingProcess[, 2:58]
yield <- ChemicalManufacturingProcess[, 1]
# Impute missing values
replacements <- sapply(processPredictors, median, na.rm = TRUE)
for (ci in 1:ncol(processPredictors)) {
na_index <- is.na(processPredictors[, ci])
processPredictors[na_index, ci] <- replacements[ci]
}
# Remove zero-variance columns
zero_cols <- nearZeroVar(processPredictors)
processPredictors <- processPredictors[, -zero_cols]
# Train/test split
set.seed(0)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors[trainIndex, ]
trainY <- yield[trainIndex]
testX <- processPredictors[-trainIndex, ]
testY <- yield[-trainIndex]
set.seed(0)
ctrl <- trainControl(method = "boot")
preProc <- c("center", "scale")
# KNN
knnModel <- train(trainX, trainY, method = "knn", preProc = preProc, tuneLength = 10)
knn_perf <- postResample(predict(knnModel, testX), testY)
# Neural Net
nnetGrid <- expand.grid(size = 1:10, decay = c(0, 0.01, 0.1))
nnetModel <- train(trainX, trainY, method = "nnet", preProc = preProc, linout = TRUE, trace = FALSE,
tuneGrid = nnetGrid, maxit = 500)
nnet_perf <- postResample(predict(nnetModel, testX), testY)
# Averaged Neural Net
avNNetModel <- train(trainX, trainY, method = "avNNet", preProc = preProc, linout = TRUE, trace = FALSE,
maxit = 500)
avnnet_perf <- postResample(predict(avNNetModel, testX), testY)
# MARS
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(trainX, trainY, method = "earth", preProc = preProc, tuneGrid = marsGrid)
mars_perf <- postResample(predict(marsModel, testX), testY)
# SVM Radial
svmModel <- train(trainX, trainY, method = "svmRadial", preProc = preProc, tuneLength = 20)
svm_perf <- postResample(predict(svmModel, testX), testY)
# Compare results
nonlinear_results <- data.frame(
Model = c("KNN", "Neural Net", "Avg Neural Net", "MARS", "SVM"),
RMSE = c(knn_perf["RMSE"], nnet_perf["RMSE"],
avnnet_perf["RMSE"], mars_perf["RMSE"], svm_perf["RMSE"]),
Rsquared = c(knn_perf["Rsquared"], nnet_perf["Rsquared"],
avnnet_perf["Rsquared"], mars_perf["Rsquared"], svm_perf["Rsquared"])
)
nonlinear_results %>% gt() %>% tab_header(title = "Test Set Performance - Nonlinear Models")
Test Set Performance - Nonlinear Models | ||
Model | RMSE | Rsquared |
---|---|---|
KNN | 1.0902785 | 0.604009507 |
Neural Net | 1.7387536 | 0.002460671 |
Avg Neural Net | 1.2007034 | 0.541615200 |
MARS | 1.1435413 | 0.556899258 |
SVM | 0.9141503 | 0.737100609 |
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 84.14
## ManufacturingProcess36 76.06
## BiologicalMaterial06 75.70
## ManufacturingProcess17 72.98
## BiologicalMaterial03 71.70
## BiologicalMaterial12 65.80
## ManufacturingProcess09 64.89
## BiologicalMaterial02 55.11
## ManufacturingProcess06 54.30
## ManufacturingProcess31 47.40
## BiologicalMaterial11 43.85
## BiologicalMaterial04 42.26
## ManufacturingProcess33 40.68
## ManufacturingProcess12 37.19
## ManufacturingProcess11 35.28
## BiologicalMaterial08 35.03
## BiologicalMaterial09 34.55
## BiologicalMaterial01 30.95
## ManufacturingProcess18 27.48
p_range <- range(processPredictors$ManufacturingProcess32)
variation <- seq(from = p_range[1], to = p_range[2], length.out = 100)
mean_predictors <- apply(processPredictors, 2, mean)
# Alternative to repmat
newdata <- as.data.frame(matrix(rep(mean_predictors, each = length(variation)), nrow = length(variation)))
colnames(newdata) <- colnames(processPredictors)
newdata$ManufacturingProcess32 <- variation
y_hat <- predict(svmModel, newdata = newdata)
plot(variation, y_hat, type = "l", lwd = 2, col = "steelblue",
xlab = "Variation in ManufacturingProcess32", ylab = "Predicted Yield")
grid()
SVM, MARS, and the averaged neural net models performed strongest overall, with SVM giving the best balance of accuracy and consistency. The top variables identified by the SVM overlapped with those from elastic net but also revealed non-linear effects. ManufacturingProcess32 remained influential and had a curvilinear impact on yield—visualized clearly above.
This reinforces the idea that nonlinear models can surface subtleties that linear models overlook—valuable when fine-tuning real-world processes.
Nonlinear models aren’t just buzzwords—they really shine when the data is messy, multidimensional, and packed with interactions. In Exercise 7.2, I saw firsthand how SVM and averaged neural networks consistently outperformed simpler methods like KNN, while MARS gave useful variable insight that confirmed what was truly driving the response.
In Exercise 7.5, working with real chemical manufacturing data reinforced those lessons. SVM stood out again with the strongest test performance, but it was just as important to see that process variables like ManufacturingProcess32 consistently influenced yield, sometimes in subtle nonlinear ways that only models like SVM or MARS could detect.