0.1 Introduction

In this assignment, I explored nonlinear regression models using two different datasets. First, I worked with Friedman’s simulated benchmark data (Exercise 7.2), which challenges models with nonlinear patterns where only some predictors matter. Then, I pivoted to a real-world chemical manufacturing dataset (Exercise 7.5), where the goal was to uncover complex relationships between process variables and product yield.

For both, I compared several nonlinear approaches—KNN, neural nets, MARS, SVM—evaluating them with RMSE and R² on test sets. Along the way, I examined variable importance and visualized the influence of top predictors to get insight beyond just the numbers.

1 Data Simulation and Visualization

This assignment starts with the Friedman1 benchmark dataset, a simulated nonlinear regression problem. It’s got 10 predictors, but only the first five (X1 to X5) are informative. My goal here is to see how well different models can pick up on that pattern—and more importantly, which ones generalize well to unseen data.

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
trainingData$x <- data.frame(trainingData$x)

testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

featurePlot(trainingData$x[, 1:5], trainingData$y, plot = "scatter", layout = c(3,2))

2 K-Nearest Neighbors Model

KNN is simple and intuitive—it uses distance to make predictions. After training it, I noticed decent performance, but I expected it might struggle as complexity increases.

set.seed(0)
knnModel <- train(x = trainingData$x, y = trainingData$y, method = "knn",
                  preProc = c("center", "scale"), tuneLength = 10)
knnPred <- predict(knnModel, newdata = testData$x)
knnPR <- postResample(pred = knnPred, obs = testData$y)

3 Neural Network Model

Here I’m testing a basic neural network using the nnet package. I expected it to do better than KNN—but it’s known to be sensitive to tuning and initialization.

nnGrid <- expand.grid(.decay = c(0, 0.01, 0.1), .size = 1:10, .bag = FALSE)
set.seed(0)
nnetModel <- train(x = trainingData$x, y = trainingData$y, method = "nnet",
                   preProc = c("center", "scale"), linout = TRUE, trace = FALSE,
                   MaxNWts = 10 * (ncol(trainingData$x)+1) + 10 + 1, maxit = 500)
nnetPred <- predict(nnetModel, newdata = testData$x)
nnetPR <- postResample(pred = nnetPred, obs = testData$y)

4 Averaged Neural Network Model

This one’s like the upgraded version of a neural net—it fits multiple networks and averages them. This usually helps reduce variance and improves generalization. Let’s see if that holds up.

set.seed(0)
avNNetModel <- train(x = trainingData$x, y = trainingData$y, method = "avNNet",
                     preProc = c("center", "scale"), linout = TRUE, trace = FALSE,
                     MaxNWts = 10 * (ncol(trainingData$x)+1) + 10 + 1, maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
avNNetPred <- predict(avNNetModel, newdata = testData$x)
avNNetPR <- postResample(pred = avNNetPred, obs = testData$y)

5 MARS Model

MARS (Multivariate Adaptive Regression Splines) builds flexible piecewise linear models. It’s interpretable and usually does well with nonlinear data—especially when you want to see variable importance.

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(0)
marsModel <- train(x = trainingData$x, y = trainingData$y, method = "earth",
                   preProc = c("center", "scale"), tuneGrid = marsGrid)
marsPred <- predict(marsModel, newdata = testData$x)
marsPR <- postResample(pred = marsPred, obs = testData$y)

6 Support Vector Machine Model

SVMs with RBF kernels are powerful when the relationship between predictors and response is nonlinear and smooth. I expected this one to perform competitively.

set.seed(0)
svmModel <- train(x = trainingData$x, y = trainingData$y, method = "svmRadial",
                  preProc = c("center", "scale"), tuneLength = 20)
svmPred <- predict(svmModel, newdata = testData$x)
svmPR <- postResample(pred = svmPred, obs = testData$y)

7 Model Performance Comparison

Now it’s time to see who really understood the assignment. I pulled all the RMSE and R² values from the test set and lined them up to compare performance.

rmses <- c(knnPR[1], nnetPR[1], avNNetPR[1], marsPR[1], svmPR[1])
r2s <- c(knnPR[2], nnetPR[2], avNNetPR[2], marsPR[2], svmPR[2])
methods <- c("KNN", "NN", "AvgNN", "MARS", "SVM")

results <- data.frame(Model = methods, RMSE = rmses, R2 = r2s) %>%
  arrange(desc(RMSE)) %>%
  gt() %>%
  tab_header(title = "Test Set Performance Metrics")

results
Test Set Performance Metrics
Model RMSE R2
KNN 3.204059 0.6819919
NN 2.649316 0.7177210
SVM 2.059719 0.8279547
AvgNN 2.055993 0.8323657
MARS 1.322734 0.9291489

8 Variable Importance from MARS

MARS gives us a peek behind the curtain: which predictors mattered most. This helps confirm if the model is focusing on the real drivers (X1–X5) or just noise.

varImp(marsModel)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

9 Interpretation

This was a great exercise in comparing nonlinear models side-by-side. Here’s what stood out to me:

Overall, I’d reach for SVM or AvgNN if I wanted accuracy, and MARS if I wanted insight. This helped me see just how much model choice (and tuning) matters in predictive performance.

10 Exercise 7.5 - Nonlinear Models

10.1 A. Nonlinear Model Comparison

data(ChemicalManufacturingProcess)

processPredictors <- ChemicalManufacturingProcess[, 2:58]
yield <- ChemicalManufacturingProcess[, 1]

# Impute missing values
replacements <- sapply(processPredictors, median, na.rm = TRUE)
for (ci in 1:ncol(processPredictors)) {
  na_index <- is.na(processPredictors[, ci])
  processPredictors[na_index, ci] <- replacements[ci]
}

# Remove zero-variance columns
zero_cols <- nearZeroVar(processPredictors)
processPredictors <- processPredictors[, -zero_cols]

# Train/test split
set.seed(0)
trainIndex <- createDataPartition(yield, p = 0.8, list = FALSE)
trainX <- processPredictors[trainIndex, ]
trainY <- yield[trainIndex]
testX <- processPredictors[-trainIndex, ]
testY <- yield[-trainIndex]
set.seed(0)
ctrl <- trainControl(method = "boot")
preProc <- c("center", "scale")

# KNN
knnModel <- train(trainX, trainY, method = "knn", preProc = preProc, tuneLength = 10)
knn_perf <- postResample(predict(knnModel, testX), testY)

# Neural Net
nnetGrid <- expand.grid(size = 1:10, decay = c(0, 0.01, 0.1))
nnetModel <- train(trainX, trainY, method = "nnet", preProc = preProc, linout = TRUE, trace = FALSE,
                   tuneGrid = nnetGrid, maxit = 500)
nnet_perf <- postResample(predict(nnetModel, testX), testY)

# Averaged Neural Net
avNNetModel <- train(trainX, trainY, method = "avNNet", preProc = preProc, linout = TRUE, trace = FALSE,
                     maxit = 500)
avnnet_perf <- postResample(predict(avNNetModel, testX), testY)

# MARS
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
marsModel <- train(trainX, trainY, method = "earth", preProc = preProc, tuneGrid = marsGrid)
mars_perf <- postResample(predict(marsModel, testX), testY)

# SVM Radial
svmModel <- train(trainX, trainY, method = "svmRadial", preProc = preProc, tuneLength = 20)
svm_perf <- postResample(predict(svmModel, testX), testY)

# Compare results
nonlinear_results <- data.frame(
  Model = c("KNN", "Neural Net", "Avg Neural Net", "MARS", "SVM"),
  RMSE = c(knn_perf["RMSE"], nnet_perf["RMSE"],
           avnnet_perf["RMSE"], mars_perf["RMSE"], svm_perf["RMSE"]),
  Rsquared = c(knn_perf["Rsquared"], nnet_perf["Rsquared"],
               avnnet_perf["Rsquared"], mars_perf["Rsquared"], svm_perf["Rsquared"])
)

nonlinear_results %>% gt() %>% tab_header(title = "Test Set Performance - Nonlinear Models")
Test Set Performance - Nonlinear Models
Model RMSE Rsquared
KNN 1.0902785 0.604009507
Neural Net 1.7387536 0.002460671
Avg Neural Net 1.2007034 0.541615200
MARS 1.1435413 0.556899258
SVM 0.9141503 0.737100609

10.2 B. Variable Importance

varImp(svmModel)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   84.14
## ManufacturingProcess36   76.06
## BiologicalMaterial06     75.70
## ManufacturingProcess17   72.98
## BiologicalMaterial03     71.70
## BiologicalMaterial12     65.80
## ManufacturingProcess09   64.89
## BiologicalMaterial02     55.11
## ManufacturingProcess06   54.30
## ManufacturingProcess31   47.40
## BiologicalMaterial11     43.85
## BiologicalMaterial04     42.26
## ManufacturingProcess33   40.68
## ManufacturingProcess12   37.19
## ManufacturingProcess11   35.28
## BiologicalMaterial08     35.03
## BiologicalMaterial09     34.55
## BiologicalMaterial01     30.95
## ManufacturingProcess18   27.48

10.3 C. Visualize Top Predictor Impact

p_range <- range(processPredictors$ManufacturingProcess32)
variation <- seq(from = p_range[1], to = p_range[2], length.out = 100)
mean_predictors <- apply(processPredictors, 2, mean)

# Alternative to repmat
newdata <- as.data.frame(matrix(rep(mean_predictors, each = length(variation)), nrow = length(variation)))
colnames(newdata) <- colnames(processPredictors)
newdata$ManufacturingProcess32 <- variation

y_hat <- predict(svmModel, newdata = newdata)

plot(variation, y_hat, type = "l", lwd = 2, col = "steelblue",
     xlab = "Variation in ManufacturingProcess32", ylab = "Predicted Yield")
grid()

10.4 Interpretation

SVM, MARS, and the averaged neural net models performed strongest overall, with SVM giving the best balance of accuracy and consistency. The top variables identified by the SVM overlapped with those from elastic net but also revealed non-linear effects. ManufacturingProcess32 remained influential and had a curvilinear impact on yield—visualized clearly above.

This reinforces the idea that nonlinear models can surface subtleties that linear models overlook—valuable when fine-tuning real-world processes.

10.5 Reflection

Nonlinear models aren’t just buzzwords—they really shine when the data is messy, multidimensional, and packed with interactions. In Exercise 7.2, I saw firsthand how SVM and averaged neural networks consistently outperformed simpler methods like KNN, while MARS gave useful variable insight that confirmed what was truly driving the response.

In Exercise 7.5, working with real chemical manufacturing data reinforced those lessons. SVM stood out again with the strongest test performance, but it was just as important to see that process variables like ManufacturingProcess32 consistently influenced yield, sometimes in subtle nonlinear ways that only models like SVM or MARS could detect.

10.6 At the end of the day, this assignment wasn’t just about tuning models—it was about learning which tools to use when the problem calls for more than a straight line. Whether I’m working with simulated data or real-world manufacturing processes, nonlinear modeling is now a go-to in my toolkit.