The following RMD contains CUNY SPS DATA 624 Spring 2025 answers to exercises in chapter 7 of Kuhn and Johnson’s Applied Predictive Modeling textbook http://appliedpredictivemodeling.com/. This chapter focuses on nonlinear regression models. The chapter 7 Exercises answered here include 7.2 and 7.5.
library(tidyverse)
library(Seurat)
library(caret)
library(earth)
library(kernlab)
library(nnet)
library(knitr)
library(kableExtra)
Prompt: Friedman (1991) introduced several benchmark data sets created by simulation. One of these simulations used the following nonlinear equation to create data: [see textbook for formula] where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
# Provided code
library(mlbench)
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
# Tune several models on these data. For example:
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
# knnModel
# Predict on the model
knnPred <- predict(knnModel, newdata = testData$x)
#Get the test set performance values
knnPerformanceFriedman <- postResample(pred = knnPred, obs = testData$y)
knnPerformanceFriedman
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
# Train first model MARS
marsFitFriedman <- earth(trainingData$x, trainingData$y)
marsFitFriedman
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
# See summary to get prune values
summary(marsFitFriedman)
## Call: earth(x=trainingData$x, y=trainingData$y)
##
## coefficients
## (Intercept) 18.451984
## h(0.621722-X1) -11.074396
## h(0.601063-X2) -10.744225
## h(X3-0.281766) 20.607853
## h(0.447442-X3) 17.880232
## h(X3-0.447442) -23.282007
## h(X3-0.636458) 15.150350
## h(0.734892-X4) -10.027487
## h(X4-0.734892) 9.092045
## h(0.850094-X5) -4.723407
## h(X5-0.850094) 10.832932
## h(X6-0.361791) -1.956821
##
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
# Tune first model
# Define the candidate models to test
marsGridFriedman <- expand.grid(.degree = 1:2, .nprune = 2:12)
# Fix the seed so that the results can be reproduced
set.seed(64)
# Train the MARS model
marsModelFriedman <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGridFriedman,
trControl = trainControl(method = "cv"))
# marsModelFriedman
# > The final values used for the model were nprune = 12 and degree = 2
# Predict on the model
marsPredFriedman <- predict(marsModelFriedman, newdata = testData$x)
#Get the test set performance values
marsPerformanceFriedman <- postResample(pred = marsPredFriedman, obs = testData$y)
marsPerformanceFriedman
## RMSE Rsquared MAE
## 1.2803060 0.9335241 1.0168673
# Tune second model
svmModelFriedman <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
# svmModelFriedman
# Predict on the model
svmPredFriedman <- predict(svmModelFriedman, newdata = testData$x)
#Get the test set performance values
svmPerformanceFriedman <- postResample(pred = svmPredFriedman, obs = testData$y)
svmPerformanceFriedman
## RMSE Rsquared MAE
## 2.0748819 0.8254684 1.5760551
# Tune third model
nnetGridFriedman <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
# this uses bagging
.bag = FALSE)
set.seed(64)
# Train the neural network model
nnetModelFriedman <- train(x = trainingData$x,
y = trainingData$y,
method = "avNNet",
tuneGrid = nnetGridFriedman,
trControl = trainControl(method = "cv", number = 10),
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
# nnetModelFriedman
# Predict on the model
nnetPredFriedman <- predict(nnetModelFriedman, newdata = testData$x)
#Get the test set performance values
nnetPerformanceFriedman <- postResample(pred = nnetPredFriedman, obs = testData$y)
nnetPerformanceFriedman
## RMSE Rsquared MAE
## 1.9595108 0.8492313 1.4737210
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
The model that gives the best performance is the MARS model. It has the lowest RMSE and MAE, and highest R squared, which makes it the best performing model.
model | RMSE | Rsquared | MAE |
---|---|---|---|
KNN | 3.2041 | 0.6820 | 2.5683 |
MARS | 1.2803 | 0.9335 | 1.0169 |
SVM | 2.0749 | 0.8255 | 1.5761 |
NNet | 1.9595 | 0.8492 | 1.4737 |
MARS does select the informative predictors, which we know are X1, X2, X3, X4 and X5 - ordered below as X1, X4, X2, X5, and X3.
varImp(marsModelFriedman)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.33
## X2 48.88
## X5 15.63
## X3 0.00
Prompt: Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
# Set seed for reproducability
set.seed(64)
# Provided code
library(AppliedPredictiveModeling)
data(ChemicalManufacturingProcess)
# Predict missing values using KNN imputation function
chem_impute_vals <- preProcess(
ChemicalManufacturingProcess,
method = c("knnImpute")
)
# Fill the predicted values
chem_imputed <- predict(
chem_impute_vals,
ChemicalManufacturingProcess
)
# Check for any remaining missing values
# sum(is.na(chem_imputed))
# Like in 6.2 b, filter using nearZeroVar
near_zero_chem <- nearZeroVar(chem_imputed)
filtered_chem <- chem_imputed[, -near_zero_chem]
# Create partition of 80-20
index_chem <- createDataPartition(filtered_chem$Yield, p = .8, list = FALSE)
# Split X into training and testing (exclude 'Yield' column)
X_train_chem <- filtered_chem[index_chem, !(names(filtered_chem) %in% "Yield")]
X_test_chem <- filtered_chem[-index_chem, !(names(filtered_chem) %in% "Yield")]
# Split y into training and testing (only the 'Yield' column)
y_train_chem <- filtered_chem[index_chem, "Yield", drop = TRUE]
y_test_chem <- filtered_chem[-index_chem, "Yield", drop = TRUE]
# Train the pls model from 6.3
plsModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "pls",
preProcess = c("center", "scale"),
tuneLength = 20,
trControl = trainControl(method = "cv", number = 10))
# Train first model MARS
marsFitChem <- earth(X_train_chem, y_train_chem)
marsFitChem
## Selected 18 of 22 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 0.3000591 RSS 24.75696 GRSq 0.7099421 RSq 0.8314745
# See summary to get prune values
summary(marsFitChem)
## Call: earth(x=X_train_chem, y=y_train_chem)
##
## coefficients
## (Intercept) -1.2531287
## h(1.08846-BiologicalMaterial03) -0.2840134
## h(BiologicalMaterial03-1.08846) -0.8726237
## h(-1.04109-BiologicalMaterial08) -0.8041139
## h(BiologicalMaterial08- -1.04109) -0.2569086
## h(-0.531729-BiologicalMaterial11) 0.7057958
## h(BiologicalMaterial11- -0.531729) 0.3607210
## h(-1.23517-ManufacturingProcess09) -0.6034754
## h(ManufacturingProcess09- -1.23517) 0.2631493
## h(-1.38676-ManufacturingProcess13) 1.7665158
## h(0.763612-ManufacturingProcess28) 0.2537783
## h(ManufacturingProcess28-0.763612) 2.9373045
## h(ManufacturingProcess32- -1.198) 1.0646118
## h(ManufacturingProcess32-1.39591) -1.1080155
## h(-1.02435-ManufacturingProcess33) 0.8034576
## h(ManufacturingProcess33- -1.02435) -0.3426442
## h(0.129718-ManufacturingProcess35) 0.1851258
## h(0.0988802-ManufacturingProcess39) -0.2572506
##
## Selected 18 of 22 terms, and 10 of 56 predictors
## Termination condition: RSq changed by less than 0.001 at 22 terms
## Importance: ManufacturingProcess32, ManufacturingProcess09, ...
## Number of terms at each degree of interaction: 1 17 (additive model)
## GCV 0.3000591 RSS 24.75696 GRSq 0.7099421 RSq 0.8314745
# Tune first model
# Define the candidate models to test
marsGridChem <- expand.grid(.degree = 1:2, .nprune = 2:23)
# Fix the seed so that the results can be reproduced
set.seed(64)
# Train the MARS model
marsModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGridChem,
trControl = trainControl(method = "cv"))
# marsModelChem
# > The final values used for the model were nprune = 5 and degree = 1.
# Predict on the model
marsPredChem <- predict(marsModelChem, newdata = X_test_chem)
#Get the test set performance values
marsPerformanceChem <- postResample(pred = marsPredChem, obs = y_test_chem)
marsPerformanceChem
## RMSE Rsquared MAE
## 0.6397000 0.5439205 0.5311242
# Tune second model
svmModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,
trControl = trainControl(method = "cv"))
# svmModelChem
# > The final values used for the model were sigma = 0.01424208 and C = 16.
# Predict on the model
svmPredChem <- predict(svmModelChem, newdata = X_test_chem)
#Get the test set performance values
svmPerformanceChem <- postResample(pred = svmPredChem, obs = y_test_chem)
svmPerformanceChem
## RMSE Rsquared MAE
## 0.5265988 0.6873471 0.4488771
# Tune third model
nnetGridChem <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
# this uses bagging
.bag = FALSE)
set.seed(64)
# Train the neural network model
nnetModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "avNNet",
tuneGrid = nnetGridFriedman,
trControl = trainControl(method = "cv", number = 10),
## Automatically standardize data prior to modeling and prediction
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
# nnetModelChem
# > The final values used for the model were size = 2, decay = 0.1 and bag = FALSE.
# Predict on the model
nnetPredChem <- predict(nnetModelChem, newdata = X_test_chem)
#Get the test set performance values
nnetPerformanceChem <- postResample(pred = nnetPredChem, obs = y_test_chem)
nnetPerformanceChem
## RMSE Rsquared MAE
## 0.5276199 0.6851127 0.4391207
# Tune several models on these data. For example:
knnModelChem <- train(x = X_train_chem,
y = y_train_chem,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
# knnModelChem
# > The final value used for the model was k = 11.
# Predict on the model
knnPredChem <- predict(knnModelChem, newdata = X_test_chem)
#Get the test set performance values
knnPerformanceChem <- postResample(pred = knnPredChem, obs = y_test_chem)
knnPerformanceChem
## RMSE Rsquared MAE
## 0.6375514 0.5544677 0.5344060
The model that gives the best performance is the SVM model. It has the lowest RMSE, second lowest MAE, and highest R squared, which makes it the best performing model overall. The close second would be the Neural Network model, which has scores close to the SVM model.
# Create a unified data frame of performance results
chem_model_performance <- data.frame(
model = c("KNN", "MARS", "SVM", "NNet"),
RMSE = c(knnPerformanceChem["RMSE"],
marsPerformanceChem["RMSE"],
svmPerformanceChem["RMSE"],
nnetPerformanceChem["RMSE"]),
Rsquared = c(knnPerformanceChem["Rsquared"],
marsPerformanceChem["Rsquared"],
svmPerformanceChem["Rsquared"],
nnetPerformanceChem["Rsquared"]),
MAE = c(knnPerformanceChem["MAE"],
marsPerformanceChem["MAE"],
svmPerformanceChem["MAE"],
nnetPerformanceChem["MAE"])
)
# View the result in a kable visual
kable(chem_model_performance, caption = "Model Performance Comparison", digits = 4) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
model | RMSE | Rsquared | MAE |
---|---|---|---|
KNN | 0.6376 | 0.5545 | 0.5344 |
MARS | 0.6397 | 0.5439 | 0.5311 |
SVM | 0.5266 | 0.6873 | 0.4489 |
NNet | 0.5276 | 0.6851 | 0.4391 |
The optimal nonlinear regression model is the SVM model.
SVM Top Predictors:
Top 3 are process variables
6/10 total are process variables
Process variables are more important, but they are not dominating the list.
Linear Model (PLS) Top Predictors:
Top 4 are process variables
7/10 are process variables
Process variables are more important, and they are closer to dominating the list than the SVM model.
print("SVM model important predictors:")
## [1] "SVM model important predictors:"
varImp(svmModelChem)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 93.63
## ManufacturingProcess09 82.69
## BiologicalMaterial06 81.95
## ManufacturingProcess17 77.98
## BiologicalMaterial03 76.79
## ManufacturingProcess36 74.03
## BiologicalMaterial12 72.56
## ManufacturingProcess31 64.06
## BiologicalMaterial02 63.29
## ManufacturingProcess06 55.97
## ManufacturingProcess11 52.96
## ManufacturingProcess33 49.85
## ManufacturingProcess12 42.18
## BiologicalMaterial11 41.00
## ManufacturingProcess02 39.55
## ManufacturingProcess18 39.55
## ManufacturingProcess29 39.22
## BiologicalMaterial04 38.96
## BiologicalMaterial01 38.29
print("Linear model important predictors:")
## [1] "Linear model important predictors:"
varImp(plsModelChem)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess09 89.64
## ManufacturingProcess36 85.99
## ManufacturingProcess13 83.96
## BiologicalMaterial02 75.58
## BiologicalMaterial06 75.41
## BiologicalMaterial03 74.40
## ManufacturingProcess17 71.80
## ManufacturingProcess33 70.50
## ManufacturingProcess12 64.83
## ManufacturingProcess06 63.69
## ManufacturingProcess11 61.52
## BiologicalMaterial01 58.13
## BiologicalMaterial12 57.39
## BiologicalMaterial08 57.36
## BiologicalMaterial04 54.07
## BiologicalMaterial11 51.69
## ManufacturingProcess20 39.99
## ManufacturingProcess04 37.17
## ManufacturingProcess30 37.00
# Get the most important predictors with varImp
svm_importance <- varImp(svmModelChem, scale = FALSE)$importance |>
arrange(-Overall)
# Get the top 10 important predictors
top10 <- head(svm_importance, 10)
# Loop through top 10 predictors and create scatter plots
top10_predictors <- rownames(top10)
for (predictor in top10_predictors) {
p <- ggplot(filtered_chem, aes(x = .data[[predictor]], y = Yield)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "salmon") + # Add a linear regression line
labs(x = predictor, y = "Yield", title = paste("Relationship between", predictor, "and Yield")) +
theme_minimal()
print(p)
}
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Positive relationships:
ManufacturingProcess32 - Stronger
ManufacturingProcess09 - Stronger
BiologicalMaterial06 - Weaker
BiologicalMaterial03 - Weaker
BiologicalMaterial12 - Weaker
BiologicalMaterial02 - Weaker
Negative relationships:
ManufacturingProcess13 - Stronger
ManufacturingProcess17 - Stronger
ManufacturingProcess36 - Stronger
ManufacturingProcess31** - Weaker, data does not have a solid linear trend
There are 4 groups: strong postive, strong negative, weak postive, weak negative. Within these groups, the stronger relationships will be most helpful in improving yield in future runs of the manufacturing process.
ManufacturingProcess32 and ManufacturingProcess09 have strong postive relationships with Yeild. This means increased numerical values for those processes correlate with higher yeild. This is good information as increasing their values within an efficient range could result in higher yields.
ManufacturingProcess13, ManufacturingProcess17 and ManufacturingProcess36 have strong negative relationships with Yeild. This means increased numerical values for those processes correlate with lower yeild. This is good information as more regulation to these processes to decrease their values could result in higher yields.