DATA 624 - HW 8

library(mlbench)
library(tidyverse)
library(caret)
library(kernlab)
library(earth)
library(gridExtra)

Exercise 7.2

Question 1: Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:

\[y = 10sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)\] where the ‘x’ values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation).The package ‘mlbench’ contains a function called ‘mlbench.friedman1’ that simulates these data: Which models appear to give the best performance?

set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)

Let’s create a dataframe that will save the model metrics such as ‘RSME’, ‘Rsqaured’ and ‘MAE’

model_metrics <- data.frame(
  matrix(vector(), 0, 3,
         dimnames = list(c(), c("RMSE", "Rsquared", "MAE"))),
  stringsAsFactors = FALSE
)

(a) Tune several models on these data. For example:

Model 1: K-Nearest Neighbors (KNN):

K-nearest Neighbors Model:

library(caret)
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)
knnModel

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  3.466085  0.5121775  2.816838
##    7  3.349428  0.5452823  2.727410
##    9  3.264276  0.5785990  2.660026
##   11  3.214216  0.6024244  2.603767
##   13  3.196510  0.6176570  2.591935
##   15  3.184173  0.6305506  2.577482
##   17  3.183130  0.6425367  2.567787
##   19  3.198752  0.6483184  2.592683
##   21  3.188993  0.6611428  2.588787
##   23  3.200458  0.6638353  2.604529
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

Prediction Using the KNN Model:

knnPred <- predict(knnModel, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = knnPred, obs = testData$y))

Now that the KNN model has been trained and its performance metrics have been recorded in the model_metrics data frame, we can proceed to train and evaluate additional models using the same approach.

Model 2: Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines Model:

In this section, we will tune a Multivariate Adaptive Regression Splines (MARS) model. We’ll start by defining a tuneGrid to specify the combinations of model parameters to evaluate

mars_tune_grid <- expand.grid(degree = 1:2, nprune = 2:38)

Now that we’ve defined our tuning grid, we can proceed to train the MARS model using the train() function from the caret package, setting the method to “earth”.

mars_model <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "earth",
  preProcess = c("center", "scale"),
  tuneGrid = mars_tune_grid,
  trControl = trainControl(method = "cv")
)
mars_model

## Multivariate Adaptive Regression Spline 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE      Rsquared   MAE     
##   1        2      4.462296  0.2176253  3.697979
##   1        3      3.720663  0.4673821  2.949121
##   1        4      2.680039  0.7094916  2.123848
##   1        5      2.333538  0.7781559  1.856629
##   1        6      2.367933  0.7754329  1.901509
##   1        7      1.809983  0.8656526  1.414985
##   1        8      1.692656  0.8838936  1.333678
##   1        9      1.704958  0.8845683  1.339517
##   1       10      1.688559  0.8842495  1.309838
##   1       11      1.669043  0.8886165  1.296522
##   1       12      1.645066  0.8892796  1.271981
##   1       13      1.655570  0.8886896  1.271232
##   1       14      1.666354  0.8879143  1.285545
##   1       15      1.666354  0.8879143  1.285545
##   1       16      1.666354  0.8879143  1.285545
##   1       17      1.666354  0.8879143  1.285545
##   1       18      1.666354  0.8879143  1.285545
##   1       19      1.666354  0.8879143  1.285545
##   1       20      1.666354  0.8879143  1.285545
##   1       21      1.666354  0.8879143  1.285545
##   1       22      1.666354  0.8879143  1.285545
##   1       23      1.666354  0.8879143  1.285545
##   1       24      1.666354  0.8879143  1.285545
##   1       25      1.666354  0.8879143  1.285545
##   1       26      1.666354  0.8879143  1.285545
##   1       27      1.666354  0.8879143  1.285545
##   1       28      1.666354  0.8879143  1.285545
##   1       29      1.666354  0.8879143  1.285545
##   1       30      1.666354  0.8879143  1.285545
##   1       31      1.666354  0.8879143  1.285545
##   1       32      1.666354  0.8879143  1.285545
##   1       33      1.666354  0.8879143  1.285545
##   1       34      1.666354  0.8879143  1.285545
##   1       35      1.666354  0.8879143  1.285545
##   1       36      1.666354  0.8879143  1.285545
##   1       37      1.666354  0.8879143  1.285545
##   1       38      1.666354  0.8879143  1.285545
##   2        2      4.440854  0.2204755  3.686796
##   2        3      3.697203  0.4714312  2.938566
##   2        4      2.664266  0.7149235  2.119566
##   2        5      2.313371  0.7837374  1.852172
##   2        6      2.335796  0.7875253  1.841919
##   2        7      1.833248  0.8623489  1.461538
##   2        8      1.695822  0.8883658  1.329030
##   2        9      1.555106  0.9028532  1.221365
##   2       10      1.497805  0.9088251  1.158054
##   2       11      1.419280  0.9207646  1.139722
##   2       12      1.326566  0.9315939  1.066200
##   2       13      1.266877  0.9354482  1.002983
##   2       14      1.256694  0.9349307  1.006273
##   2       15      1.311401  0.9316487  1.039213
##   2       16      1.292299  0.9336915  1.022410
##   2       17      1.304364  0.9321032  1.031643
##   2       18      1.304364  0.9321032  1.031643
##   2       19      1.304364  0.9321032  1.031643
##   2       20      1.304364  0.9321032  1.031643
##   2       21      1.304364  0.9321032  1.031643
##   2       22      1.304364  0.9321032  1.031643
##   2       23      1.304364  0.9321032  1.031643
##   2       24      1.304364  0.9321032  1.031643
##   2       25      1.304364  0.9321032  1.031643
##   2       26      1.304364  0.9321032  1.031643
##   2       27      1.304364  0.9321032  1.031643
##   2       28      1.304364  0.9321032  1.031643
##   2       29      1.304364  0.9321032  1.031643
##   2       30      1.304364  0.9321032  1.031643
##   2       31      1.304364  0.9321032  1.031643
##   2       32      1.304364  0.9321032  1.031643
##   2       33      1.304364  0.9321032  1.031643
##   2       34      1.304364  0.9321032  1.031643
##   2       35      1.304364  0.9321032  1.031643
##   2       36      1.304364  0.9321032  1.031643
##   2       37      1.304364  0.9321032  1.031643
##   2       38      1.304364  0.9321032  1.031643
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

We can see that after nprune = 14 and degree = 2, the values of RMSE, MAE, and \(R^2\) show little to no improvement. So we can now move forward and make predictions using the MARS model.

Prediction Using the MARS Model

Now that we’ve trained and tuned the MARS model, we’ll use it to make predictions on the test data and evaluate its performance.

mars_pred <- predict(mars_model, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = mars_pred, obs = testData$y))

Now that our MARS model is complete and its performance metrics have been added to the model_metrics data frame, we can proceed to train a Support Vector machine model.

Model 3: Support Vector Machine (SVM)

Support Vector Machine (SVM) Model:

svm_model <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "svmRadial",  
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "cv"),
  tuneLength = 10
)
svm_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE      Rsquared   MAE     
##     0.25  2.475865  0.7988016  1.992183
##     0.50  2.214599  0.8166629  1.774775
##     1.00  2.046869  0.8398392  1.623489
##     2.00  1.953012  0.8519284  1.552263
##     4.00  1.891812  0.8587464  1.523110
##     8.00  1.875668  0.8604860  1.532309
##    16.00  1.879129  0.8595239  1.542180
##    32.00  1.879024  0.8595396  1.542161
##    64.00  1.879024  0.8595396  1.542161
##   128.00  1.879024  0.8595396  1.542161
## 
## Tuning parameter 'sigma' was held constant at a value of 0.06437208
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06437208 and C = 8.

Prediction Using the Support Vector Machines Model:

svm_Pred <- predict(svm_model, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = svm_Pred, obs = testData$y))

Model 4: Neural Networks

Neural Networks Model:

In this section we will define Neural Networks tunegrid.

nnet_tune_grid <- expand.grid(size = 1:5, decay = c(0, 0.01, 0.1), bag = FALSE)

Below code ensures we don’t exceed the number of weights (MaxNWts) allowed by the neural net model

nnet_maxnwts <- 10 * (ncol(trainingData$x) + 1) + 5 + 1

Now that we’ve defined our tuning grid and Calculated the MaxNWts, we can proceed to train the Neural Networks model using the train() function from the caret package, setting the method to “avNNet”.

nnet_model <- train(
  x = trainingData$x,
  y = trainingData$y,
  method = "avNNet",
  preProcess = c("center", "scale"),
  tuneGrid = nnet_tune_grid,
  trControl = trainControl(method = "cv"),
  linout = TRUE,
  trace = FALSE,
  MaxNWts = nnet_maxnwts,
  maxit = 500
)
nnet_model

## Model Averaged Neural Network 
## 
## 200 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE     
##   1     0.00   2.501962  0.7484271  1.974938
##   1     0.01   2.439901  0.7566477  1.904959
##   1     0.10   2.444509  0.7553394  1.906025
##   2     0.00   2.438279  0.7534711  1.916776
##   2     0.01   2.477303  0.7517684  1.948808
##   2     0.10   2.509162  0.7434646  1.952322
##   3     0.00   2.036144  0.8297174  1.599508
##   3     0.01   2.144052  0.8187443  1.711054
##   3     0.10   2.169155  0.8134280  1.724083
##   4     0.00   1.923496  0.8501136  1.558840
##   4     0.01   2.002755  0.8427114  1.574649
##   4     0.10   2.127435  0.8242030  1.699589
##   5     0.00   2.279592  0.7874826  1.736309
##   5     0.01   2.134616  0.8147876  1.706574
##   5     0.10   2.257318  0.7994550  1.799377
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.

Prediction Using the Neural Networks Model

Now that our Neural Network model is trained and evaluated, we’ve added its performance metrics to our tracking data frame for comparison with other models.

nnet_Pred <- predict(nnet_model, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = nnet_Pred, obs = testData$y))

Model Comparison:

colnames(model_metrics) <- c("RMSE","Rsquared","MAE")
model_metrics$NAME <- c("KNN", "MARS", "Neural Network", "SVM")
model_metrics <- model_metrics %>% relocate("NAME") %>% arrange(RMSE)

knitr::kable(model_metrics, caption = "Performance Comparison of Models")

Performance Comparison of Models
NAME	RMSE	Rsquared	MAE
MARS	1.277999	0.9338365	1.014707
SVM	1.961986	0.8481771	1.492773
Neural Network	2.059968	0.8280925	1.563584
KNN	3.204060	0.6819919	2.568346

From the table above, we can see that the MARS model had the best performance.

(b) Does MARS select the informative predictors (those named X1–X5)?

mars_vi <- varImp(mars_model)
mars_vi

## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.40
## X2   49.00
## X5   15.72
## X3    0.00

Yes, the MARS model successfully selected the informative predictors X1 to X5, which aligns with the structure of the simulated data. Among them, X1 was the most influential, followed by X4 and X2. While X3 and X5 were included in the model, their contributions were relatively smaller. The remaining variables (X6–X10) were not selected at all, confirming that MARS effectively excluded non-informative predictors as expected.

Exercise 7.5

Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Laod, Preprocessing and Split The Data:

library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
str(ChemicalManufacturingProcess)

## 'data.frame':    176 obs. of  58 variables:
##  $ Yield                 : num  38 42.4 42 41.4 42.5 ...
##  $ BiologicalMaterial01  : num  6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
##  $ BiologicalMaterial02  : num  49.6 61 61 61 63.3 ...
##  $ BiologicalMaterial03  : num  57 67.5 67.5 67.5 72.2 ...
##  $ BiologicalMaterial04  : num  12.7 14.6 14.6 14.6 14 ...
##  $ BiologicalMaterial05  : num  19.5 19.4 19.4 19.4 17.9 ...
##  $ BiologicalMaterial06  : num  43.7 53.1 53.1 53.1 54.7 ...
##  $ BiologicalMaterial07  : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ BiologicalMaterial08  : num  16.7 19 19 19 18.2 ...
##  $ BiologicalMaterial09  : num  11.4 12.6 12.6 12.6 12.8 ...
##  $ BiologicalMaterial10  : num  3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
##  $ BiologicalMaterial11  : num  138 154 154 154 148 ...
##  $ BiologicalMaterial12  : num  18.8 21.1 21.1 21.1 21.1 ...
##  $ ManufacturingProcess01: num  NA 0 0 0 10.7 12 11.5 12 12 12 ...
##  $ ManufacturingProcess02: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess03: num  NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
##  $ ManufacturingProcess04: num  NA 917 912 911 918 924 933 929 928 938 ...
##  $ ManufacturingProcess05: num  NA 1032 1004 1015 1028 ...
##  $ ManufacturingProcess06: num  NA 210 207 213 206 ...
##  $ ManufacturingProcess07: num  NA 177 178 177 178 178 177 178 177 177 ...
##  $ ManufacturingProcess08: num  NA 178 178 177 178 178 178 178 177 177 ...
##  $ ManufacturingProcess09: num  43 46.6 45.1 44.9 45 ...
##  $ ManufacturingProcess10: num  NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
##  $ ManufacturingProcess11: num  NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
##  $ ManufacturingProcess12: num  NA 0 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess13: num  35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
##  $ ManufacturingProcess14: num  4898 4869 4878 4897 4992 ...
##  $ ManufacturingProcess15: num  6108 6095 6087 6102 6233 ...
##  $ ManufacturingProcess16: num  4682 4617 4617 4635 4733 ...
##  $ ManufacturingProcess17: num  35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
##  $ ManufacturingProcess18: num  4865 4867 4877 4872 4886 ...
##  $ ManufacturingProcess19: num  6049 6097 6078 6073 6102 ...
##  $ ManufacturingProcess20: num  4665 4621 4621 4611 4659 ...
##  $ ManufacturingProcess21: num  0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
##  $ ManufacturingProcess22: num  NA 3 4 5 8 9 1 2 3 4 ...
##  $ ManufacturingProcess23: num  NA 0 1 2 4 1 1 2 3 1 ...
##  $ ManufacturingProcess24: num  NA 3 4 5 18 1 1 2 3 4 ...
##  $ ManufacturingProcess25: num  4873 4869 4897 4892 4930 ...
##  $ ManufacturingProcess26: num  6074 6107 6116 6111 6151 ...
##  $ ManufacturingProcess27: num  4685 4630 4637 4630 4684 ...
##  $ ManufacturingProcess28: num  10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
##  $ ManufacturingProcess29: num  21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
##  $ ManufacturingProcess30: num  9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
##  $ ManufacturingProcess31: num  69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
##  $ ManufacturingProcess32: num  156 169 173 171 171 173 159 161 160 164 ...
##  $ ManufacturingProcess33: num  66 66 66 68 70 70 65 65 65 66 ...
##  $ ManufacturingProcess34: num  2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ ManufacturingProcess35: num  486 508 509 496 468 490 475 478 491 488 ...
##  $ ManufacturingProcess36: num  0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
##  $ ManufacturingProcess37: num  0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
##  $ ManufacturingProcess38: num  3 2 2 2 2 2 2 2 3 3 ...
##  $ ManufacturingProcess39: num  7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
##  $ ManufacturingProcess40: num  NA 0.1 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess41: num  NA 0.15 0 0 0 0 0 0 0 0 ...
##  $ ManufacturingProcess42: num  11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
##  $ ManufacturingProcess43: num  3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
##  $ ManufacturingProcess44: num  1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
##  $ ManufacturingProcess45: num  2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...

prep_obj <- preProcess(ChemicalManufacturingProcess, 
                       method = c("BoxCox", "knnImpute", "center", "scale")) # Preprocessing

processed_data <- predict(prep_obj, ChemicalManufacturingProcess)
processed_data$Yield <- ChemicalManufacturingProcess$Yield # Add back the target variable (not transformed or imputed)


set.seed(123) 
ind <- sample(seq_len(nrow(processed_data)), size = floor(0.85 * nrow(processed_data))) # Split the data into training (85%) and testing (15%)
train <- processed_data[ind, ]
test <- processed_data[-ind, ]

Create Models:

# Set up resampling control
control <- trainControl(method = "cv")

# Create an empty results tracker
model_metrics <- data.frame(NAME = character(),
                            RMSE = numeric(),
                            Rsquared = numeric(),
                            MAE = numeric(),
                            stringsAsFactors = FALSE)

# 1. KNN
knn_model <- train(Yield ~ ., data = train,
                   method = "knn",
                   preProcess = c("center", "scale"),
                   trControl = control,
                   tuneLength = 10)
knn_pred <- predict(knn_model, newdata = test)
knn_perf <- postResample(pred = knn_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics, 
                       c(NAME = "KNN", knn_perf))

# 2. MARS
mars_grid <- expand.grid(degree = 1:2, nprune = 2:20)
mars_model <- train(Yield ~ ., data = train,
                    method = "earth",
                    preProcess = c("center", "scale"),
                    tuneGrid = mars_grid,
                    trControl = control)
mars_pred <- predict(mars_model, newdata = test)
mars_perf <- postResample(pred = mars_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics, 
                       c(NAME = "MARS", mars_perf))

# 3. SVM
svm_model <- train(Yield ~ ., data = train,
                   method = "svmRadial",
                   preProcess = c("center", "scale"),
                   trControl = control,
                   tuneLength = 10)
svm_pred <- predict(svm_model, newdata = test)
svm_perf <- postResample(pred = svm_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics, 
                       c(NAME = "SVM", svm_perf))

# 4. Neural Network
nnet_grid <- expand.grid(size = 1:5, decay = c(0, 0.01, 0.1),bag = FALSE)
nnet_maxnwts <- 5 * (ncol(train) - 1 + 1) + 5 + 1
nnet_model <- train(Yield ~ ., data = train,
                    method = "avNNet",
                    preProcess = c("center", "scale"),
                    tuneGrid = nnet_grid,
                    trControl = control,
                    linout = TRUE,
                    trace = FALSE,
                    MaxNWts = nnet_maxnwts,
                    maxit = 500)
nnet_pred <- predict(nnet_model, newdata = test)
nnet_perf <- postResample(pred = nnet_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics, 
                       c(NAME = "Neural Network", nnet_perf))

(a) Which nonlinear regression model gives the optimal resampling and test set performance?

To evaluate nonlinear regression models, we trained and tuned four models on the ChemicalManufacturingProcess dataset: K-Nearest Neighbors (KNN), Multivariate Adaptive Regression Splines (MARS), Support Vector Machine (SVM) with a radial kernel, and a Neural Network using model averaging.

We assessed model performance using Root Mean Squared Error (RMSE), R-squared, and Mean Absolute Error (MAE) on a held-out test set. The results are summarized below:

model_comparison <- as.data.frame(rbind(
  "MARS" = postResample(pred = mars_pred, obs = test$Yield),
  "SVM" = postResample(pred = svm_pred, obs = test$Yield),
  "NeuralNet" = postResample(pred = nnet_pred, obs = test$Yield),
  "KNN" = postResample(pred = knn_pred, obs = test$Yield)
)) %>%
  tibble::rownames_to_column("Model") %>%
  mutate(across(c(RMSE, Rsquared, MAE), round, 3)) %>%
  arrange(RMSE)

knitr::kable(model_comparison, caption = "Performance Comparison of Nonlinear Models")

Performance Comparison of Nonlinear Models
Model	RMSE	Rsquared	MAE
MARS	1.078	0.760	0.861
SVM	1.157	0.787	0.762
NeuralNet	1.176	0.733	0.847
KNN	1.584	0.556	1.216

Based on the lowest RMSE and competitive R-squared, the MARS model performed best overall on the test set. Therefore, MARS is considered the optimal nonlinear model for this dataset.

(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

mars_importance <- varImp(mars_model)
print(mars_importance)

## earth variable importance
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   55.38
## ManufacturingProcess33   35.13
## ManufacturingProcess17   35.13
## ManufacturingProcess06   31.75
## ManufacturingProcess04   25.18
## ManufacturingProcess44   18.04
## ManufacturingProcess28   11.34
## ManufacturingProcess39   11.34
## BiologicalMaterial05      0.00

The MARS model identified 10 important predictors, with ManufacturingProcess32 being the most influential, followed by ManufacturingProcess09, ManufacturingProcess17, and others. Most of the top predictors came from manufacturing process variables, while only one biological variable (BiologicalMaterial05) appeared and had zero importance. This suggests that process variables played a much bigger role in predicting yield compared to biological variables.

Compared to the optimal linear model (from Exercise 6.3), the top predictors are similar in that ManufacturingProcess32 and ManufacturingProcess09 were also important there, confirming their strong relationship with yield. However, the nonlinear model captures more subtle interactions and may give better performance overall.

(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

We get variable importance from the MARS model

mars_importance <- varImp(mars_model)$importance

library(dplyr)
top_vars <- mars_importance %>%
  tibble::rownames_to_column(var = "Predictor") %>%
  arrange(desc(Overall)) %>%
  slice(1:10) %>%
  pull(Predictor)

featurePlot(x = train[, top_vars],
            y = train$Yield,
            plot = "scatter",
            layout = c(5, 2))

Using varImp() and featurePlot(), we visualized how the top 10 predictors from the optimal nonlinear model (MARS) relate to Yield. The scatterplots show clear non-linear patterns for several manufacturing process variables, especially ManufacturingProcess32, 09, and 17.

These plots suggest that process variables have a stronger influence on yield than biological ones, which aligns with the variable importance results. Compared to the linear model from Exercise 6.3, MARS identifies more complex relationships, supporting its effectiveness for modeling this chemical process.