library(mlbench)
library(tidyverse)
library(caret)
library(kernlab)
library(earth)
library(gridExtra)
Question 1: Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\[y = 10sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2)\] where the ‘x’ values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation).The package ‘mlbench’ contains a function called ‘mlbench.friedman1’ that simulates these data: Which models appear to give the best performance?
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Let’s create a dataframe that will save the model metrics such as ‘RSME’, ‘Rsqaured’ and ‘MAE’
model_metrics <- data.frame(
matrix(vector(), 0, 3,
dimnames = list(c(), c("RMSE", "Rsquared", "MAE"))),
stringsAsFactors = FALSE
)
(a) Tune several models on these data. For example:
K-nearest Neighbors Model:
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
Prediction Using the KNN Model:
knnPred <- predict(knnModel, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = knnPred, obs = testData$y))
Now that the KNN model has been trained and its performance metrics have been recorded in the model_metrics data frame, we can proceed to train and evaluate additional models using the same approach.
Multivariate Adaptive Regression Splines Model:
In this section, we will tune a Multivariate Adaptive Regression Splines (MARS) model. We’ll start by defining a tuneGrid to specify the combinations of model parameters to evaluate
mars_tune_grid <- expand.grid(degree = 1:2, nprune = 2:38)
Now that we’ve defined our tuning grid, we can proceed to train the MARS model using the train() function from the caret package, setting the method to “earth”.
mars_model <- train(
x = trainingData$x,
y = trainingData$y,
method = "earth",
preProcess = c("center", "scale"),
tuneGrid = mars_tune_grid,
trControl = trainControl(method = "cv")
)
mars_model
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.462296 0.2176253 3.697979
## 1 3 3.720663 0.4673821 2.949121
## 1 4 2.680039 0.7094916 2.123848
## 1 5 2.333538 0.7781559 1.856629
## 1 6 2.367933 0.7754329 1.901509
## 1 7 1.809983 0.8656526 1.414985
## 1 8 1.692656 0.8838936 1.333678
## 1 9 1.704958 0.8845683 1.339517
## 1 10 1.688559 0.8842495 1.309838
## 1 11 1.669043 0.8886165 1.296522
## 1 12 1.645066 0.8892796 1.271981
## 1 13 1.655570 0.8886896 1.271232
## 1 14 1.666354 0.8879143 1.285545
## 1 15 1.666354 0.8879143 1.285545
## 1 16 1.666354 0.8879143 1.285545
## 1 17 1.666354 0.8879143 1.285545
## 1 18 1.666354 0.8879143 1.285545
## 1 19 1.666354 0.8879143 1.285545
## 1 20 1.666354 0.8879143 1.285545
## 1 21 1.666354 0.8879143 1.285545
## 1 22 1.666354 0.8879143 1.285545
## 1 23 1.666354 0.8879143 1.285545
## 1 24 1.666354 0.8879143 1.285545
## 1 25 1.666354 0.8879143 1.285545
## 1 26 1.666354 0.8879143 1.285545
## 1 27 1.666354 0.8879143 1.285545
## 1 28 1.666354 0.8879143 1.285545
## 1 29 1.666354 0.8879143 1.285545
## 1 30 1.666354 0.8879143 1.285545
## 1 31 1.666354 0.8879143 1.285545
## 1 32 1.666354 0.8879143 1.285545
## 1 33 1.666354 0.8879143 1.285545
## 1 34 1.666354 0.8879143 1.285545
## 1 35 1.666354 0.8879143 1.285545
## 1 36 1.666354 0.8879143 1.285545
## 1 37 1.666354 0.8879143 1.285545
## 1 38 1.666354 0.8879143 1.285545
## 2 2 4.440854 0.2204755 3.686796
## 2 3 3.697203 0.4714312 2.938566
## 2 4 2.664266 0.7149235 2.119566
## 2 5 2.313371 0.7837374 1.852172
## 2 6 2.335796 0.7875253 1.841919
## 2 7 1.833248 0.8623489 1.461538
## 2 8 1.695822 0.8883658 1.329030
## 2 9 1.555106 0.9028532 1.221365
## 2 10 1.497805 0.9088251 1.158054
## 2 11 1.419280 0.9207646 1.139722
## 2 12 1.326566 0.9315939 1.066200
## 2 13 1.266877 0.9354482 1.002983
## 2 14 1.256694 0.9349307 1.006273
## 2 15 1.311401 0.9316487 1.039213
## 2 16 1.292299 0.9336915 1.022410
## 2 17 1.304364 0.9321032 1.031643
## 2 18 1.304364 0.9321032 1.031643
## 2 19 1.304364 0.9321032 1.031643
## 2 20 1.304364 0.9321032 1.031643
## 2 21 1.304364 0.9321032 1.031643
## 2 22 1.304364 0.9321032 1.031643
## 2 23 1.304364 0.9321032 1.031643
## 2 24 1.304364 0.9321032 1.031643
## 2 25 1.304364 0.9321032 1.031643
## 2 26 1.304364 0.9321032 1.031643
## 2 27 1.304364 0.9321032 1.031643
## 2 28 1.304364 0.9321032 1.031643
## 2 29 1.304364 0.9321032 1.031643
## 2 30 1.304364 0.9321032 1.031643
## 2 31 1.304364 0.9321032 1.031643
## 2 32 1.304364 0.9321032 1.031643
## 2 33 1.304364 0.9321032 1.031643
## 2 34 1.304364 0.9321032 1.031643
## 2 35 1.304364 0.9321032 1.031643
## 2 36 1.304364 0.9321032 1.031643
## 2 37 1.304364 0.9321032 1.031643
## 2 38 1.304364 0.9321032 1.031643
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
We can see that after nprune = 14 and degree = 2, the values of RMSE, MAE, and \(R^2\) show little to no improvement. So we can now move forward and make predictions using the MARS model.
Prediction Using the MARS Model
Now that we’ve trained and tuned the MARS model, we’ll use it to make predictions on the test data and evaluate its performance.
mars_pred <- predict(mars_model, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = mars_pred, obs = testData$y))
Now that our MARS model is complete and its performance metrics have been added to the model_metrics data frame, we can proceed to train a Support Vector machine model.
Support Vector Machine (SVM) Model:
svm_model <- train(
x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv"),
tuneLength = 10
)
svm_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.475865 0.7988016 1.992183
## 0.50 2.214599 0.8166629 1.774775
## 1.00 2.046869 0.8398392 1.623489
## 2.00 1.953012 0.8519284 1.552263
## 4.00 1.891812 0.8587464 1.523110
## 8.00 1.875668 0.8604860 1.532309
## 16.00 1.879129 0.8595239 1.542180
## 32.00 1.879024 0.8595396 1.542161
## 64.00 1.879024 0.8595396 1.542161
## 128.00 1.879024 0.8595396 1.542161
##
## Tuning parameter 'sigma' was held constant at a value of 0.06437208
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06437208 and C = 8.
Prediction Using the Support Vector Machines Model:
svm_Pred <- predict(svm_model, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = svm_Pred, obs = testData$y))
Neural Networks Model:
In this section we will define Neural Networks tunegrid.
nnet_tune_grid <- expand.grid(size = 1:5, decay = c(0, 0.01, 0.1), bag = FALSE)
Below code ensures we don’t exceed the number of weights (MaxNWts) allowed by the neural net model
nnet_maxnwts <- 10 * (ncol(trainingData$x) + 1) + 5 + 1
Now that we’ve defined our tuning grid and Calculated the MaxNWts, we can proceed to train the Neural Networks model using the train() function from the caret package, setting the method to “avNNet”.
nnet_model <- train(
x = trainingData$x,
y = trainingData$y,
method = "avNNet",
preProcess = c("center", "scale"),
tuneGrid = nnet_tune_grid,
trControl = trainControl(method = "cv"),
linout = TRUE,
trace = FALSE,
MaxNWts = nnet_maxnwts,
maxit = 500
)
nnet_model
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0.00 2.501962 0.7484271 1.974938
## 1 0.01 2.439901 0.7566477 1.904959
## 1 0.10 2.444509 0.7553394 1.906025
## 2 0.00 2.438279 0.7534711 1.916776
## 2 0.01 2.477303 0.7517684 1.948808
## 2 0.10 2.509162 0.7434646 1.952322
## 3 0.00 2.036144 0.8297174 1.599508
## 3 0.01 2.144052 0.8187443 1.711054
## 3 0.10 2.169155 0.8134280 1.724083
## 4 0.00 1.923496 0.8501136 1.558840
## 4 0.01 2.002755 0.8427114 1.574649
## 4 0.10 2.127435 0.8242030 1.699589
## 5 0.00 2.279592 0.7874826 1.736309
## 5 0.01 2.134616 0.8147876 1.706574
## 5 0.10 2.257318 0.7994550 1.799377
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.
Prediction Using the Neural Networks Model
Now that our Neural Network model is trained and evaluated, we’ve added its performance metrics to our tracking data frame for comparison with other models.
nnet_Pred <- predict(nnet_model, newdata = testData$x)
model_metrics <- rbind(model_metrics, postResample(pred = nnet_Pred, obs = testData$y))
colnames(model_metrics) <- c("RMSE","Rsquared","MAE")
model_metrics$NAME <- c("KNN", "MARS", "Neural Network", "SVM")
model_metrics <- model_metrics %>% relocate("NAME") %>% arrange(RMSE)
knitr::kable(model_metrics, caption = "Performance Comparison of Models")
| NAME | RMSE | Rsquared | MAE |
|---|---|---|---|
| MARS | 1.277999 | 0.9338365 | 1.014707 |
| SVM | 1.961986 | 0.8481771 | 1.492773 |
| Neural Network | 2.059968 | 0.8280925 | 1.563584 |
| KNN | 3.204060 | 0.6819919 | 2.568346 |
From the table above, we can see that the MARS model had the best performance.
(b) Does MARS select the informative predictors (those named X1–X5)?
mars_vi <- varImp(mars_model)
mars_vi
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
Yes, the MARS model successfully selected the informative predictors X1 to X5, which aligns with the structure of the simulated data. Among them, X1 was the most influential, followed by X4 and X2. While X3 and X5 were included in the model, their contributions were relatively smaller. The remaining variables (X6–X10) were not selected at all, confirming that MARS effectively excluded non-informative predictors as expected.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
library(AppliedPredictiveModeling)
data("ChemicalManufacturingProcess")
str(ChemicalManufacturingProcess)
## 'data.frame': 176 obs. of 58 variables:
## $ Yield : num 38 42.4 42 41.4 42.5 ...
## $ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
## $ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
## $ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
## $ BiologicalMaterial04 : num 12.7 14.6 14.6 14.6 14 ...
## $ BiologicalMaterial05 : num 19.5 19.4 19.4 19.4 17.9 ...
## $ BiologicalMaterial06 : num 43.7 53.1 53.1 53.1 54.7 ...
## $ BiologicalMaterial07 : num 100 100 100 100 100 100 100 100 100 100 ...
## $ BiologicalMaterial08 : num 16.7 19 19 19 18.2 ...
## $ BiologicalMaterial09 : num 11.4 12.6 12.6 12.6 12.8 ...
## $ BiologicalMaterial10 : num 3.46 3.46 3.46 3.46 3.05 3.78 3.04 3.85 3.85 3.85 ...
## $ BiologicalMaterial11 : num 138 154 154 154 148 ...
## $ BiologicalMaterial12 : num 18.8 21.1 21.1 21.1 21.1 ...
## $ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
## $ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
## $ ManufacturingProcess04: num NA 917 912 911 918 924 933 929 928 938 ...
## $ ManufacturingProcess05: num NA 1032 1004 1015 1028 ...
## $ ManufacturingProcess06: num NA 210 207 213 206 ...
## $ ManufacturingProcess07: num NA 177 178 177 178 178 177 178 177 177 ...
## $ ManufacturingProcess08: num NA 178 178 177 178 178 178 178 177 177 ...
## $ ManufacturingProcess09: num 43 46.6 45.1 44.9 45 ...
## $ ManufacturingProcess10: num NA NA NA NA NA NA 11.6 10.2 9.7 10.1 ...
## $ ManufacturingProcess11: num NA NA NA NA NA NA 11.5 11.3 11.1 10.2 ...
## $ ManufacturingProcess12: num NA 0 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess13: num 35.5 34 34.8 34.8 34.6 34 32.4 33.6 33.9 34.3 ...
## $ ManufacturingProcess14: num 4898 4869 4878 4897 4992 ...
## $ ManufacturingProcess15: num 6108 6095 6087 6102 6233 ...
## $ ManufacturingProcess16: num 4682 4617 4617 4635 4733 ...
## $ ManufacturingProcess17: num 35.5 34 34.8 34.8 33.9 33.4 33.8 33.6 33.9 35.3 ...
## $ ManufacturingProcess18: num 4865 4867 4877 4872 4886 ...
## $ ManufacturingProcess19: num 6049 6097 6078 6073 6102 ...
## $ ManufacturingProcess20: num 4665 4621 4621 4611 4659 ...
## $ ManufacturingProcess21: num 0 0 0 0 -0.7 -0.6 1.4 0 0 1 ...
## $ ManufacturingProcess22: num NA 3 4 5 8 9 1 2 3 4 ...
## $ ManufacturingProcess23: num NA 0 1 2 4 1 1 2 3 1 ...
## $ ManufacturingProcess24: num NA 3 4 5 18 1 1 2 3 4 ...
## $ ManufacturingProcess25: num 4873 4869 4897 4892 4930 ...
## $ ManufacturingProcess26: num 6074 6107 6116 6111 6151 ...
## $ ManufacturingProcess27: num 4685 4630 4637 4630 4684 ...
## $ ManufacturingProcess28: num 10.7 11.2 11.1 11.1 11.3 11.4 11.2 11.1 11.3 11.4 ...
## $ ManufacturingProcess29: num 21 21.4 21.3 21.3 21.6 21.7 21.2 21.2 21.5 21.7 ...
## $ ManufacturingProcess30: num 9.9 9.9 9.4 9.4 9 10.1 11.2 10.9 10.5 9.8 ...
## $ ManufacturingProcess31: num 69.1 68.7 69.3 69.3 69.4 68.2 67.6 67.9 68 68.5 ...
## $ ManufacturingProcess32: num 156 169 173 171 171 173 159 161 160 164 ...
## $ ManufacturingProcess33: num 66 66 66 68 70 70 65 65 65 66 ...
## $ ManufacturingProcess34: num 2.4 2.6 2.6 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
## $ ManufacturingProcess35: num 486 508 509 496 468 490 475 478 491 488 ...
## $ ManufacturingProcess36: num 0.019 0.019 0.018 0.018 0.017 0.018 0.019 0.019 0.019 0.019 ...
## $ ManufacturingProcess37: num 0.5 2 0.7 1.2 0.2 0.4 0.8 1 1.2 1.8 ...
## $ ManufacturingProcess38: num 3 2 2 2 2 2 2 2 3 3 ...
## $ ManufacturingProcess39: num 7.2 7.2 7.2 7.2 7.3 7.2 7.3 7.3 7.4 7.1 ...
## $ ManufacturingProcess40: num NA 0.1 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess41: num NA 0.15 0 0 0 0 0 0 0 0 ...
## $ ManufacturingProcess42: num 11.6 11.1 12 10.6 11 11.5 11.7 11.4 11.4 11.3 ...
## $ ManufacturingProcess43: num 3 0.9 1 1.1 1.1 2.2 0.7 0.8 0.9 0.8 ...
## $ ManufacturingProcess44: num 1.8 1.9 1.8 1.8 1.7 1.8 2 2 1.9 1.9 ...
## $ ManufacturingProcess45: num 2.4 2.2 2.3 2.1 2.1 2 2.2 2.2 2.1 2.4 ...
prep_obj <- preProcess(ChemicalManufacturingProcess,
method = c("BoxCox", "knnImpute", "center", "scale")) # Preprocessing
processed_data <- predict(prep_obj, ChemicalManufacturingProcess)
processed_data$Yield <- ChemicalManufacturingProcess$Yield # Add back the target variable (not transformed or imputed)
set.seed(123)
ind <- sample(seq_len(nrow(processed_data)), size = floor(0.85 * nrow(processed_data))) # Split the data into training (85%) and testing (15%)
train <- processed_data[ind, ]
test <- processed_data[-ind, ]
# Set up resampling control
control <- trainControl(method = "cv")
# Create an empty results tracker
model_metrics <- data.frame(NAME = character(),
RMSE = numeric(),
Rsquared = numeric(),
MAE = numeric(),
stringsAsFactors = FALSE)
# 1. KNN
knn_model <- train(Yield ~ ., data = train,
method = "knn",
preProcess = c("center", "scale"),
trControl = control,
tuneLength = 10)
knn_pred <- predict(knn_model, newdata = test)
knn_perf <- postResample(pred = knn_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics,
c(NAME = "KNN", knn_perf))
# 2. MARS
mars_grid <- expand.grid(degree = 1:2, nprune = 2:20)
mars_model <- train(Yield ~ ., data = train,
method = "earth",
preProcess = c("center", "scale"),
tuneGrid = mars_grid,
trControl = control)
mars_pred <- predict(mars_model, newdata = test)
mars_perf <- postResample(pred = mars_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics,
c(NAME = "MARS", mars_perf))
# 3. SVM
svm_model <- train(Yield ~ ., data = train,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = control,
tuneLength = 10)
svm_pred <- predict(svm_model, newdata = test)
svm_perf <- postResample(pred = svm_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics,
c(NAME = "SVM", svm_perf))
# 4. Neural Network
nnet_grid <- expand.grid(size = 1:5, decay = c(0, 0.01, 0.1),bag = FALSE)
nnet_maxnwts <- 5 * (ncol(train) - 1 + 1) + 5 + 1
nnet_model <- train(Yield ~ ., data = train,
method = "avNNet",
preProcess = c("center", "scale"),
tuneGrid = nnet_grid,
trControl = control,
linout = TRUE,
trace = FALSE,
MaxNWts = nnet_maxnwts,
maxit = 500)
nnet_pred <- predict(nnet_model, newdata = test)
nnet_perf <- postResample(pred = nnet_pred, obs = test$Yield)
model_metrics <- rbind(model_metrics,
c(NAME = "Neural Network", nnet_perf))
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
To evaluate nonlinear regression models, we trained and tuned four models on the ChemicalManufacturingProcess dataset: K-Nearest Neighbors (KNN), Multivariate Adaptive Regression Splines (MARS), Support Vector Machine (SVM) with a radial kernel, and a Neural Network using model averaging.
We assessed model performance using Root Mean Squared Error (RMSE), R-squared, and Mean Absolute Error (MAE) on a held-out test set. The results are summarized below:
model_comparison <- as.data.frame(rbind(
"MARS" = postResample(pred = mars_pred, obs = test$Yield),
"SVM" = postResample(pred = svm_pred, obs = test$Yield),
"NeuralNet" = postResample(pred = nnet_pred, obs = test$Yield),
"KNN" = postResample(pred = knn_pred, obs = test$Yield)
)) %>%
tibble::rownames_to_column("Model") %>%
mutate(across(c(RMSE, Rsquared, MAE), round, 3)) %>%
arrange(RMSE)
knitr::kable(model_comparison, caption = "Performance Comparison of Nonlinear Models")
| Model | RMSE | Rsquared | MAE |
|---|---|---|---|
| MARS | 1.078 | 0.760 | 0.861 |
| SVM | 1.157 | 0.787 | 0.762 |
| NeuralNet | 1.176 | 0.733 | 0.847 |
| KNN | 1.584 | 0.556 | 1.216 |
Based on the lowest RMSE and competitive R-squared, the MARS model performed best overall on the test set. Therefore, MARS is considered the optimal nonlinear model for this dataset.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
mars_importance <- varImp(mars_model)
print(mars_importance)
## earth variable importance
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 55.38
## ManufacturingProcess33 35.13
## ManufacturingProcess17 35.13
## ManufacturingProcess06 31.75
## ManufacturingProcess04 25.18
## ManufacturingProcess44 18.04
## ManufacturingProcess28 11.34
## ManufacturingProcess39 11.34
## BiologicalMaterial05 0.00
The MARS model identified 10 important predictors, with ManufacturingProcess32 being the most influential, followed by ManufacturingProcess09, ManufacturingProcess17, and others. Most of the top predictors came from manufacturing process variables, while only one biological variable (BiologicalMaterial05) appeared and had zero importance. This suggests that process variables played a much bigger role in predicting yield compared to biological variables.
Compared to the optimal linear model (from Exercise 6.3), the top predictors are similar in that ManufacturingProcess32 and ManufacturingProcess09 were also important there, confirming their strong relationship with yield. However, the nonlinear model captures more subtle interactions and may give better performance overall.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
We get variable importance from the MARS model
mars_importance <- varImp(mars_model)$importance
library(dplyr)
top_vars <- mars_importance %>%
tibble::rownames_to_column(var = "Predictor") %>%
arrange(desc(Overall)) %>%
slice(1:10) %>%
pull(Predictor)
featurePlot(x = train[, top_vars],
y = train$Yield,
plot = "scatter",
layout = c(5, 2))
Using varImp() and featurePlot(), we visualized how the top 10 predictors from the optimal nonlinear model (MARS) relate to Yield. The scatterplots show clear non-linear patterns for several manufacturing process variables, especially ManufacturingProcess32, 09, and 17.
These plots suggest that process variables have a stronger influence on yield than biological ones, which aligns with the variable importance results. Compared to the linear model from Exercise 6.3, MARS identifies more complex relationships, supporting its effectiveness for modeling this chemical process.