Introduction

The following RMD contains CUNY SPS DATA 624 Spring 2025 answers to exercises in chapter 7 of Kuhn and Johnson’s Applied Predictive Modeling textbook http://appliedpredictivemodeling.com/. This chapter focuses on nonlinear regression models. The chapter 7 Exercises answered here include 7.2 and 7.5.

Load Libraries

library(tidyverse)
library(Seurat)
library(caret)
library(earth)
library(kernlab)
library(nnet)
library(knitr)
library(kableExtra)

7.2

Prompt: Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: [see textbook for formula] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:

# Provided code

library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)

## or other methods.

## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
# Tune several models on these data. For example:
knnModel <- train(x = trainingData$x,
                  y = trainingData$y,
                  method = "knn",
                  preProc = c("center", "scale"),
                  tuneLength = 10)

# knnModel

# Predict on the model 
knnPred <- predict(knnModel, newdata = testData$x)

#Get the test set performance values
knnPerformanceFriedman <- postResample(pred = knnPred, obs = testData$y)
knnPerformanceFriedman
##      RMSE  Rsquared       MAE 
## 3.2040595 0.6819919 2.5683461

Tune several models on these data

Multivariate Adaptive Regression Splines (MARS)

# Train first model MARS
marsFitFriedman <- earth(trainingData$x, trainingData$y)
marsFitFriedman
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
# See summary to get prune values
summary(marsFitFriedman)
## Call: earth(x=trainingData$x, y=trainingData$y)
## 
##                coefficients
## (Intercept)       18.451984
## h(0.621722-X1)   -11.074396
## h(0.601063-X2)   -10.744225
## h(X3-0.281766)    20.607853
## h(0.447442-X3)    17.880232
## h(X3-0.447442)   -23.282007
## h(X3-0.636458)    15.150350
## h(0.734892-X4)   -10.027487
## h(X4-0.734892)     9.092045
## h(0.850094-X5)    -4.723407
## h(X5-0.850094)    10.832932
## h(X6-0.361791)    -1.956821
## 
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556    RSS 397.9654    GRSq 0.8968524    RSq 0.9183982
# Tune first model
# Define the candidate models to test
marsGridFriedman <- expand.grid(.degree = 1:2, .nprune = 2:12)

# Fix the seed so that the results can be reproduced 
set.seed(64)

# Train the MARS model
marsModelFriedman <- train(x = trainingData$x, 
                           y = trainingData$y,
                                             method = "earth",
                                             # Explicitly declare the candidate models to test 
                                             tuneGrid = marsGridFriedman,
                                             trControl = trainControl(method = "cv"))

# marsModelFriedman
  # > The final values used for the model were nprune = 12 and degree = 2

# Predict on the model 
marsPredFriedman <- predict(marsModelFriedman, newdata = testData$x)

#Get the test set performance values
marsPerformanceFriedman <- postResample(pred = marsPredFriedman, obs = testData$y)
marsPerformanceFriedman
##      RMSE  Rsquared       MAE 
## 1.2803060 0.9335241 1.0168673

Support Vector Machines (SVMs)

# Tune second model
svmModelFriedman <- train(x = trainingData$x, 
                           y = trainingData$y, 
                           method = "svmRadial",
                           preProc = c("center", "scale"), 
                           tuneLength = 14,
                           trControl = trainControl(method = "cv"))
                                     
# svmModelFriedman

# Predict on the model 
svmPredFriedman <- predict(svmModelFriedman, newdata = testData$x)

#Get the test set performance values
svmPerformanceFriedman <- postResample(pred = svmPredFriedman, obs = testData$y)
svmPerformanceFriedman
##      RMSE  Rsquared       MAE 
## 2.0748819 0.8254684 1.5760551

Neural Networks (NN)

# Tune third model
nnetGridFriedman <- expand.grid(.decay = c(0, 0.01, .1),
                                                .size = c(1:10),
                                                # this uses bagging
                                                .bag = FALSE)


set.seed(64)

# Train the neural network model
nnetModelFriedman <-  train(x = trainingData$x, 
                          y = trainingData$y, 
                          method = "avNNet",
                          tuneGrid = nnetGridFriedman,
                          trControl = trainControl(method = "cv", number = 10),
                          ## Automatically standardize data prior to modeling and prediction
                          preProc = c("center", "scale"),
                          linout = TRUE,
                          trace = FALSE,
                          MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                          maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
# nnetModelFriedman

# Predict on the model 
nnetPredFriedman <- predict(nnetModelFriedman, newdata = testData$x)

#Get the test set performance values
nnetPerformanceFriedman <- postResample(pred = nnetPredFriedman, obs = testData$y)
nnetPerformanceFriedman
##      RMSE  Rsquared       MAE 
## 1.9595108 0.8492313 1.4737210

Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?

The model that gives the best performance is the MARS model. It has the lowest RMSE and MAE, and highest R squared, which makes it the best performing model.

Model Performance Comparison
model RMSE Rsquared MAE
KNN 3.2041 0.6820 2.5683
MARS 1.2803 0.9335 1.0169
SVM 2.0749 0.8255 1.5761
NNet 1.9595 0.8492 1.4737

MARS does select the informative predictors, which we know are X1, X2, X3, X4 and X5 - ordered below as X1, X4, X2, X5, and X3.

varImp(marsModelFriedman)
## earth variable importance
## 
##    Overall
## X1  100.00
## X4   75.33
## X2   48.88
## X5   15.63
## X3    0.00

7.5

Prompt: Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.

Data imputation, data splitting, and pre-processing from 6.3

# Set seed for reproducability 
set.seed(64)

# Provided code
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)

# Predict missing values using KNN imputation function
chem_impute_vals <- preProcess(
    ChemicalManufacturingProcess,
    method = c("knnImpute")
)

# Fill the predicted values
chem_imputed <- predict(
    chem_impute_vals,
    ChemicalManufacturingProcess
)

# Check for any remaining missing values
# sum(is.na(chem_imputed))

# Like in 6.2 b, filter using nearZeroVar
near_zero_chem <- nearZeroVar(chem_imputed)
filtered_chem <- chem_imputed[, -near_zero_chem]

# Create partition of 80-20
index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)

# Split X into training and testing (exclude 'Yield' column)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem  <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]

# Split y into training and testing (only the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem  <- filtered_chem[-index_chem, "Yield", drop = TRUE]

# Train the pls model from 6.3
plsModelChem <- train(x = X_train_chem,
                      y = y_train_chem,
                      method = "pls",
                      preProcess = c("center", "scale"),
                      tuneLength = 20,
                      trControl = trainControl(method = "cv", number = 10))

Tune several models on these data

Multivariate Adaptive Regression Splines (MARS)

# Train first model MARS
marsFitChem <- earth(X_train_chem, y_train_chem)
marsFitChem
## Selected 18 of 22 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 0.3000591    RSS 24.75696    GRSq 0.7099421    RSq 0.8314745
# See summary to get prune values
summary(marsFitChem)
## Call: earth(x=X_train_chem, y=y_train_chem)
## 
##                                     coefficients
## (Intercept)                           -1.2531287
## h(1.08846-BiologicalMaterial03)       -0.2840134
## h(BiologicalMaterial03-1.08846)       -0.8726237
## h(-1.04109-BiologicalMaterial08)      -0.8041139
## h(BiologicalMaterial08- -1.04109)     -0.2569086
## h(-0.531729-BiologicalMaterial11)      0.7057958
## h(BiologicalMaterial11- -0.531729)     0.3607210
## h(-1.23517-ManufacturingProcess09)    -0.6034754
## h(ManufacturingProcess09- -1.23517)    0.2631493
## h(-1.38676-ManufacturingProcess13)     1.7665158
## h(0.763612-ManufacturingProcess28)     0.2537783
## h(ManufacturingProcess28-0.763612)     2.9373045
## h(ManufacturingProcess32- -1.198)      1.0646118
## h(ManufacturingProcess32-1.39591)     -1.1080155
## h(-1.02435-ManufacturingProcess33)     0.8034576
## h(ManufacturingProcess33- -1.02435)   -0.3426442
## h(0.129718-ManufacturingProcess35)     0.1851258
## h(0.0988802-ManufacturingProcess39)   -0.2572506
## 
## Selected 18 of 22 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 0.3000591    RSS 24.75696    GRSq 0.7099421    RSq 0.8314745
# Tune first model
# Define the candidate models to test
marsGridChem <- expand.grid(.degree = 1:2, .nprune = 2:23)

# Fix the seed so that the results can be reproduced 
set.seed(64)

# Train the MARS model
marsModelChem <- train(x = X_train_chem, 
                       y = y_train_chem,
                       method = "earth",
                       # Explicitly declare the candidate models to test 
                       tuneGrid = marsGridChem,
                       trControl = trainControl(method = "cv"))

# marsModelChem
  # > The final values used for the model were nprune = 5 and degree = 1.

# Predict on the model 
marsPredChem <- predict(marsModelChem, newdata = X_test_chem)

#Get the test set performance values
marsPerformanceChem <- postResample(pred = marsPredChem, obs = y_test_chem)
marsPerformanceChem
##      RMSE  Rsquared       MAE 
## 0.6397000 0.5439205 0.5311242

Support Vector Machines (SVMs)

# Tune second model
svmModelChem <- train(x = X_train_chem,
                      y = y_train_chem,
                      method = "svmRadial",
                      preProc = c("center", "scale"), 
                      tuneLength = 14,
                      trControl = trainControl(method = "cv"))
                                     
# svmModelChem
  # > The final values used for the model were sigma = 0.01424208 and C = 16.

# Predict on the model 
svmPredChem <- predict(svmModelChem, newdata = X_test_chem)

#Get the test set performance values
svmPerformanceChem <- postResample(pred = svmPredChem, obs = y_test_chem)
svmPerformanceChem
##      RMSE  Rsquared       MAE 
## 0.5265988 0.6873471 0.4488771

Neural Networks (NN)

# Tune third model
nnetGridChem <- expand.grid(.decay = c(0, 0.01, .1),
                                                .size = c(1:10),
                                                # this uses bagging
                                                .bag = FALSE)


set.seed(64)

# Train the neural network model
nnetModelChem <-  train(x = X_train_chem,
                        y = y_train_chem,
                        method = "avNNet",
                        tuneGrid = nnetGridFriedman,
                        trControl = trainControl(method = "cv", number = 10),
                        ## Automatically standardize data prior to modeling and prediction
                        preProc = c("center", "scale"),
                        linout = TRUE,
                        trace = FALSE,
                        MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
                        maxit = 500)

# nnetModelChem
  # > The final values used for the model were size = 2, decay = 0.1 and bag = FALSE.

# Predict on the model 
nnetPredChem <- predict(nnetModelChem, newdata = X_test_chem)

#Get the test set performance values
nnetPerformanceChem <- postResample(pred = nnetPredChem, obs = y_test_chem)
nnetPerformanceChem
##      RMSE  Rsquared       MAE 
## 0.5276199 0.6851127 0.4391207

K-Nearest Neighbors (KNN)

# Tune several models on these data. For example:
knnModelChem <- train(x = X_train_chem,
                      y = y_train_chem,
                      method = "knn",
                      preProc = c("center", "scale"),
                      tuneLength = 10)

# knnModelChem
  # > The final value used for the model was k = 11.

# Predict on the model 
knnPredChem <- predict(knnModelChem, newdata = X_test_chem)

#Get the test set performance values
knnPerformanceChem <- postResample(pred = knnPredChem, obs = y_test_chem)
knnPerformanceChem
##      RMSE  Rsquared       MAE 
## 0.6375514 0.5544677 0.5344060

a. Which nonlinear regression model gives the optimal resampling and test set performance?

The model that gives the best performance is the SVM model. It has the lowest RMSE, second lowest MAE, and highest R squared, which makes it the best performing model overall. The close second would be the Neural Network model, which has scores close to the SVM model.

# Create a unified data frame of performance results
chem_model_performance <- data.frame(
  model    = c("KNN", "MARS", "SVM", "NNet"),
  RMSE     = c(knnPerformanceChem["RMSE"],
               marsPerformanceChem["RMSE"],
               svmPerformanceChem["RMSE"],
               nnetPerformanceChem["RMSE"]),
  Rsquared = c(knnPerformanceChem["Rsquared"],
               marsPerformanceChem["Rsquared"],
               svmPerformanceChem["Rsquared"],
               nnetPerformanceChem["Rsquared"]),
  MAE      = c(knnPerformanceChem["MAE"],
               marsPerformanceChem["MAE"],
               svmPerformanceChem["MAE"],
               nnetPerformanceChem["MAE"])
)

# View the result in a kable visual
kable(chem_model_performance, caption = "Model Performance Comparison", digits = 4) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Model Performance Comparison
model RMSE Rsquared MAE
KNN 0.6376 0.5545 0.5344
MARS 0.6397 0.5439 0.5311
SVM 0.5266 0.6873 0.4489
NNet 0.5276 0.6851 0.4391

b. Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?

The optimal nonlinear regression model is the SVM model.

SVM Top Predictors:

  • Top 3 are process variables

  • 6/10 total are process variables

  • Process variables are more important, but they are not dominating the list.

Linear Model (PLS) Top Predictors:

  • Top 4 are process variables

  • 7/10 are process variables

  • Process variables are more important, and they are closer to dominating the list than the SVM model.

print("SVM model important predictors:")
## [1] "SVM model important predictors:"
varImp(svmModelChem)
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   93.63
## ManufacturingProcess09   82.69
## BiologicalMaterial06     81.95
## ManufacturingProcess17   77.98
## BiologicalMaterial03     76.79
## ManufacturingProcess36   74.03
## BiologicalMaterial12     72.56
## ManufacturingProcess31   64.06
## BiologicalMaterial02     63.29
## ManufacturingProcess06   55.97
## ManufacturingProcess11   52.96
## ManufacturingProcess33   49.85
## ManufacturingProcess12   42.18
## BiologicalMaterial11     41.00
## ManufacturingProcess02   39.55
## ManufacturingProcess18   39.55
## ManufacturingProcess29   39.22
## BiologicalMaterial04     38.96
## BiologicalMaterial01     38.29
print("Linear model important predictors:")
## [1] "Linear model important predictors:"
varImp(plsModelChem)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess09   89.64
## ManufacturingProcess36   85.99
## ManufacturingProcess13   83.96
## BiologicalMaterial02     75.58
## BiologicalMaterial06     75.41
## BiologicalMaterial03     74.40
## ManufacturingProcess17   71.80
## ManufacturingProcess33   70.50
## ManufacturingProcess12   64.83
## ManufacturingProcess06   63.69
## ManufacturingProcess11   61.52
## BiologicalMaterial01     58.13
## BiologicalMaterial12     57.39
## BiologicalMaterial08     57.36
## BiologicalMaterial04     54.07
## BiologicalMaterial11     51.69
## ManufacturingProcess20   39.99
## ManufacturingProcess04   37.17
## ManufacturingProcess30   37.00

c. Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?

# Get the most important predictors with varImp
svm_importance <- varImp(svmModelChem, scale = FALSE)$importance |>
  arrange(-Overall)

# Get the top 10 important predictors
top10 <- head(svm_importance, 10)

# Loop through top 10 predictors and create scatter plots
top10_predictors <- rownames(top10)

for (predictor in top10_predictors) {
  p <- ggplot(filtered_chem, aes(x = .data[[predictor]], y = Yield)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "salmon") +  # Add a linear regression line
    labs(x = predictor, y = "Yield", title = paste("Relationship between", predictor, "and Yield")) +
    theme_minimal()
    
  print(p)
}
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Positive relationships:

  • ManufacturingProcess32 - Stronger

  • ManufacturingProcess09 - Stronger

  • BiologicalMaterial06 - Weaker

  • BiologicalMaterial03 - Weaker

  • BiologicalMaterial12 - Weaker

  • BiologicalMaterial02 - Weaker

Negative relationships:

  • ManufacturingProcess13 - Stronger

  • ManufacturingProcess17 - Stronger

  • ManufacturingProcess36 - Stronger

  • ManufacturingProcess31** - Weaker, data does not have a solid linear trend

There are 4 groups: strong postive, strong negative, weak postive, weak negative. Within these groups, the stronger relationships will be most helpful in improving yield in future runs of the manufacturing process.

ManufacturingProcess32 and ManufacturingProcess09 have strong postive relationships with Yeild. This means increased numerical values for those processes correlate with higher yeild. This is good information as increasing their values within an efficient range could result in higher yields.

ManufacturingProcess13, ManufacturingProcess17 and ManufacturingProcess36 have strong negative relationships with Yeild. This means increased numerical values for those processes correlate with lower yeild. This is good information as more regulation to these processes to decrease their values could result in higher yields.