This document presents solutions to a series of exercises from Chapter 7 in the book Applied Predictive Modeling by Max Kuhn and Kjell Johnson. These exercises focus on non-linear regression techniques, including Neural Networks, Multivariate Adaptive Regression Splines (MARS), Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). The aim is to deepen understanding of how these models handle complex, non-linear relationships in predictive modeling tasks.
We will address the following problems step by step:
Problem 7.2: This exercise involves simulating a nonlinear dataset and applying predictive models such as k-Nearest Neighbors (KNN) and Multivariate Adaptive Regression Splines (MARS). The goal is to evaluate model performance, tune hyperparameters, and analyze the ability of MARS to select informative predictors (those named X1–X5) from the dataset.
Problem 7.5: This exercise explores the application of nonlinear regression models, specifically Support Vector Machines (SVM), k-Nearest Neighbors (KNN), and Multivariate Adaptive Regression Splines (MARS). The goal is to compare their performance in predicting a continuous response (Yield) while identifying the most important predictors. The exercise emphasizes the critical role of hyperparameter tuning, data preprocessing, and visualization to interpret model insights and optimize predictive performance.
Each solution is accompanied by R code, results, and discussions:
Use the mlbench.friedman1()
function from the
mlbench
package to create the training
(200 samples
) and test datasets
(5000 samples
). Include non-informative predictors to
evaluate model robustness.
# Load necessary libraries
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.2
library(caret)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Loading required package: lattice
library(earth)
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
# Simulate Friedman1 dataset
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
testData <- mlbench.friedman1(5000, sd = 1)
# Convert matrices to data frames
trainingData$x <- data.frame(trainingData$x)
testData$x <- data.frame(testData$x)
# Check dataset structure
str(trainingData$x)
## 'data.frame': 200 obs. of 10 variables:
## $ X1 : num 0.534 0.584 0.59 0.691 0.667 ...
## $ X2 : num 0.648 0.438 0.588 0.226 0.819 ...
## $ X3 : num 0.8508 0.6727 0.4097 0.0334 0.7168 ...
## $ X4 : num 0.1816 0.6692 0.3381 0.0669 0.8032 ...
## $ X5 : num 0.929 0.1638 0.8941 0.6374 0.0831 ...
## $ X6 : num 0.3618 0.4531 0.0268 0.525 0.2234 ...
## $ X7 : num 0.827 0.649 0.179 0.513 0.664 ...
## $ X8 : num 0.421 0.845 0.35 0.797 0.904 ...
## $ X9 : num 0.5911 0.9282 0.0176 0.6899 0.397 ...
## $ X10: num 0.589 0.758 0.444 0.445 0.55 ...
In this step, we simulated data using the
mlbench.friedman1
function from the mlbench
package. This function generates a dataset based on the Friedman1
equation:
\[ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2) \]
where: - \(x_1, x_2, \ldots, x_{10}\) are predictors. - Only \(x_1\) through \(x_5\) are informative, while \(x_6\) through \(x_{10}\) are non-informative (random noise).
Key Observations
Purpose The primary goal of this simulation is to provide a controlled environment to test regression models, evaluating their ability to: - Identify the relationships between the predictors and the response. - Handle the presence of irrelevant predictors (\(X_6\)–\(X_{10}\)).
This setup is critical for exploring the performance of nonlinear regression methods like KNN, MARS, and others in subsequent steps.
Fit a KNN model using caret
with
bootstrap resampling and tune the number of neighbors
(k
).
# KNN model with preprocessing
set.seed(123)
knnModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "boot", number = 25)
)
# Print KNN model results
print(knnModel)
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.548433 0.4919564 2.888283
## 7 3.425531 0.5255725 2.778090
## 9 3.346026 0.5523023 2.704791
## 11 3.252313 0.5875603 2.620492
## 13 3.232552 0.6000482 2.601113
## 15 3.205067 0.6203296 2.586704
## 17 3.172791 0.6408339 2.566738
## 19 3.183306 0.6494300 2.587220
## 21 3.190873 0.6556293 2.596793
## 23 3.202234 0.6597746 2.604279
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
# Predict on the test set
knnPred <- predict(knnModel, newdata = testData$x)
# Evaluate performance
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
In this step, we trained and tuned a k-Nearest Neighbors (KNN) model
on the Friedman1
dataset using the caret
package. The following steps summarize the procedure and results:
In summary:The KNN model effectively captured the
nonlinear relationships in the Friedman1
data. The optimal
number of neighbors (\(k = 17\))
provides a balance between model complexity and generalization, as
evidenced by the low RMSE and high R-squared on the test data.
Fit a MARS model using the earth
method
in caret
. Tune hyperparameters, such as the number of terms
and degree of interactions.
# MARS model with preprocessing
set.seed(123)
marsModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneLength = 10,
trControl = trainControl(method = "boot", number = 25)
)
# Print MARS model results
print(marsModel)
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 4.379381 0.2301740 3.575902
## 3 3.649438 0.4583683 2.944879
## 4 2.769352 0.6876944 2.223704
## 6 2.366383 0.7734368 1.888582
## 7 1.988717 0.8380231 1.581362
## 9 1.827116 0.8637619 1.443208
## 10 1.788268 0.8690065 1.410531
## 12 1.815936 0.8656814 1.422587
## 13 1.824463 0.8644827 1.433229
## 15 1.856755 0.8590033 1.460392
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 10 and degree = 1.
# Predict on the test set
marsPred <- predict(marsModel, newdata = testData$x)
# Evaluate performance
postResample(pred = marsPred, obs = testData$y)
## RMSE Rsquared MAE
## 1.776575 0.872700 1.358367
In this step, we trained and tuned a Multivariate Adaptive Regression
Splines (MARS) model on the Friedman1
dataset using the
earth
method from the caret
package. The
procedure and results are summarized below:
nprune
) and the degree of
interactions (degree
) were tuned using RMSE as the
criterion.Conclusion: The MARS model demonstrated superior
performance compared to the KNN model, effectively capturing the complex
nonlinear relationships in the Friedman1
data. This is
likely due to its ability to identify important variables and
interactions, making it a robust choice for modeling such data.
In this section, we summarize the results of Problem 7.2 and directly address the question: Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
Model Performance Comparison
Conclusion: The MARS model gives the best performance based on its superior test metrics.
Informative Predictors (X1–X5)
The dataset was generated using the equation: \[ y = 10 \sin(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + N(0, \sigma^2) \] where \(X1\)–\(X5\) are the informative predictors and \(X6\)–\(X10\) are non-informative noise predictors.
To evaluate whether MARS selects the informative predictors, we examined the terms selected in the final model:
summary(marsModel$finalModel)
## Call: earth(x=data.frame[200,10], y=c(18.46,16.1,17...), keepxy=TRUE, degree=1,
## nprune=10)
##
## coefficients
## (Intercept) 20.395804
## h(0.621722-X1) -10.925741
## h(0.601063-X2) -10.668385
## h(X3-0.281766) 3.966649
## h(0.447442-X3) 12.392139
## h(X3-0.636458) 7.640411
## h(0.734892-X4) -9.900621
## h(X4-0.734892) 10.274706
## h(0.850094-X5) -5.343409
## h(X6-0.361791) -1.769825
##
## Selected 10 of 18 terms, and 6 of 10 predictors (nprune=10)
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 9 (additive model)
## GCV 2.731203 RSS 447.3848 GRSq 0.889112 RSq 0.9082649
Final Answer
Which model gives the best performance?
Does MARS select the informative predictors (X1–X5)?
X1–X5
as
the main informative predictors while excluding noise predictors
(X7–X10
). The partial inclusion of X6
does not
detract from its effectiveness.data for a chemical manufacturing process preprocessed as in
Exercise 6.3
:
# Load the required library and dataset
library(AppliedPredictiveModeling)
library(caret)
# Load dataset
data(ChemicalManufacturingProcess)
# Separate predictors and response
predictors <- ChemicalManufacturingProcess[, -1] # Exclude the first column (Yield)
response <- ChemicalManufacturingProcess$Yield
# Handle missing values using median imputation
preprocess <- preProcess(predictors, method = "medianImpute")
imputed_predictors <- predict(preprocess, predictors)
# Split the data into training (80%) and testing (20%) sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(response, p = 0.8, list = FALSE)
# Create training and testing sets
train_predictors <- imputed_predictors[train_index, ]
test_predictors <- imputed_predictors[-train_index, ]
train_response <- response[train_index]
test_response <- response[-train_index]
# Train a MARS model using the 'earth' method
set.seed(123)
mars_model <- train(
x = train_predictors,
y = train_response,
method = "earth",
tuneLength = 10, # Tune over 10 values of hyperparameters
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print MARS model details
mars_model
## Multivariate Adaptive Regression Spline
##
## 144 samples
## 57 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared MAE
## 2 1.378691 0.4988104 1.1236683
## 3 1.218983 0.6300794 0.9904402
## 5 1.292456 0.5640833 1.0531348
## 7 1.352366 0.5463897 1.1064317
## 8 1.328867 0.5530096 1.0965782
## 10 1.403699 0.5200030 1.1205506
## 12 1.422694 0.4938241 1.1282104
## 13 1.423642 0.4954649 1.1182360
## 15 1.432566 0.5021617 1.1246753
## 17 1.432889 0.5050105 1.1235802
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
# Predict and evaluate on the test set
mars_pred <- predict(mars_model, newdata = test_predictors)
mars_perf <- postResample(pred = mars_pred, obs = test_response)
mars_perf
## RMSE Rsquared MAE
## 1.3765957 0.4680622 1.1161381
library(kernlab)
## Warning: package 'kernlab' was built under R version 4.3.3
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
# Train an SVM model using radial kernel
set.seed(123)
svm_model <- train(
x = train_predictors,
y = train_response,
method = "svmRadial",
tuneLength = 10, # Tune over 10 values of cost and sigma
preProcess = c("center", "scale"), # Preprocessing
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print SVM model details
svm_model
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.357053 0.5717606 1.1060197
## 0.50 1.250562 0.6057450 1.0163217
## 1.00 1.161474 0.6434975 0.9343628
## 2.00 1.111380 0.6638872 0.8952385
## 4.00 1.099131 0.6618941 0.8755585
## 8.00 1.080413 0.6781477 0.8590074
## 16.00 1.078416 0.6806858 0.8582669
## 32.00 1.078416 0.6806858 0.8582669
## 64.00 1.078416 0.6806858 0.8582669
## 128.00 1.078416 0.6806858 0.8582669
##
## Tuning parameter 'sigma' was held constant at a value of 0.01452627
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01452627 and C = 16.
# Predict and evaluate on the test set
svm_pred <- predict(svm_model, newdata = test_predictors)
svm_perf <- postResample(pred = svm_pred, obs = test_response)
svm_perf
## RMSE Rsquared MAE
## 1.2222178 0.5568576 1.0260355
# Train a KNN model
set.seed(123)
knn_model <- train(
x = train_predictors,
y = train_response,
method = "knn",
tuneLength = 10, # Tune over 10 values of 'k'
preProcess = c("center", "scale"), # Preprocessing
trControl = trainControl(method = "cv", number = 10) # 10-fold cross-validation
)
# Print KNN model details
knn_model
## k-Nearest Neighbors
##
## 144 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 128, 129, 129, 130, 128, 131, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.234126 0.5855017 0.9638509
## 7 1.292584 0.5489661 1.0408474
## 9 1.294423 0.5543285 1.0511925
## 11 1.302680 0.5447302 1.0603173
## 13 1.321252 0.5210424 1.0727730
## 15 1.346144 0.5052424 1.1017134
## 17 1.350765 0.5084828 1.0910419
## 19 1.366269 0.4950894 1.1099770
## 21 1.374749 0.4950063 1.1181499
## 23 1.392496 0.4843170 1.1313867
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Predict and evaluate on the test set
knn_pred <- predict(knn_model, newdata = test_predictors)
knn_perf <- postResample(pred = knn_pred, obs = test_response)
knn_perf
## RMSE Rsquared MAE
## 1.4063326 0.4224466 1.1606875
# Create a summary table of performance metrics
performance_results <- data.frame(
Model = c("MARS", "SVM", "KNN"),
RMSE = c(mars_perf["RMSE"], svm_perf["RMSE"], knn_perf["RMSE"]),
Rsquared = c(mars_perf["Rsquared"], svm_perf["Rsquared"], knn_perf["Rsquared"])
)
# Print the comparison table
performance_results
## Model RMSE Rsquared
## 1 MARS 1.376596 0.4680622
## 2 SVM 1.222218 0.5568576
## 3 KNN 1.406333 0.4224466
Based on the results, the performance of the three nonlinear regression models—MARS, SVM, and KNN—was evaluated using RMSE and R-squared metrics. The findings are summarized as follows:
Conclusion: Among the three models, the SVM model provided the optimal resampling and test set performance. It achieved the lowest RMSE, indicating better prediction accuracy, and the highest R-squared, suggesting a stronger explanatory power for the response variable. Thus, SVM is the recommended nonlinear regression model for this dataset.
# Variable importance for the SVM model
varImp(svm_model)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## BiologicalMaterial06 94.06
## ManufacturingProcess36 81.54
## BiologicalMaterial03 81.27
## ManufacturingProcess13 80.63
## ManufacturingProcess31 78.52
## BiologicalMaterial02 76.04
## ManufacturingProcess17 75.92
## ManufacturingProcess09 73.04
## BiologicalMaterial12 69.48
## ManufacturingProcess06 66.24
## BiologicalMaterial11 59.72
## ManufacturingProcess33 57.06
## ManufacturingProcess29 54.40
## BiologicalMaterial04 53.93
## BiologicalMaterial01 45.62
## BiologicalMaterial08 44.93
## ManufacturingProcess30 42.47
## BiologicalMaterial09 40.88
## ManufacturingProcess11 38.38
The importance of predictors in the optimal nonlinear regression model, identified using the Support Vector Machines (SVM) model with a Radial Basis Function kernel, was evaluated. The results indicate that both biological and process variables contribute significantly to the model. Among the top ten most important predictors, a mix of variables from both categories is observed.
Biological vs. Process Variables From the ranked
list of the top ten predictors: 1. The most important predictor is a
process variable (ManufacturingProcess32
) with a relative
importance of 100%.
Biological variables, such as BiologicalMaterial06
and BiologicalMaterial03
, rank second and fourth,
respectively, with relative importance values of 94.06% and
81.27%.
Process variables, such as ManufacturingProcess36
and ManufacturingProcess13
, also rank high with importance
values of 81.54% and 80.63%.
Overall, the results suggest a balanced dominance of both biological
and process variables among the most significant predictors. While
ManufacturingProcess32
holds the highest relative
importance, biological variables also rank prominently, highlighting
their critical role in explaining the response variable in this
manufacturing process.
In the optimal nonlinear regression model (SVM), both biological and process variables are critical. Process variables slightly dominate the top positions, but biological variables are also key contributors, indicating that successful modeling of the response depends on capturing interactions across both types of predictors.
suppressMessages({
suppressWarnings({
# Visualize relationships using ggplot2 for the top 10 most important predictors from the SVM model
library(ggplot2)
library(gridExtra)
# Extract top 10 SVM-selected predictors
selected_predictors <- c("ManufacturingProcess32", "BiologicalMaterial06", "ManufacturingProcess36",
"BiologicalMaterial03", "ManufacturingProcess13", "ManufacturingProcess31",
"BiologicalMaterial02", "ManufacturingProcess17", "ManufacturingProcess09",
"BiologicalMaterial12")
# Create a list to store plots
plot_list <- list()
# Loop through each selected predictor to create individual scatter plots
for (predictor in selected_predictors) {
plot_list[[predictor]] <- ggplot(data = ChemicalManufacturingProcess, aes_string(x = predictor, y = "Yield")) +
geom_point(color = "darkorange", alpha = 0.6) +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
ggtitle(paste("Relationship between", predictor, "and Yield")) +
theme_minimal() +
labs(x = predictor, y = "Yield") +
theme(plot.title = element_text(size = 5, face = 'bold'),
axis.title = element_text(size = 7),
axis.text = element_text(size = 5)) # Adjust the size as needed
}
# Combine all plots into a grid
grid.arrange(grobs = plot_list, ncol = 3)
})
})
Analysis of Relationships Between Predictors and Yield:
The visualizations of the top 10 predictors from the SVM model (optimal nonlinear regression model) against the response variable (Yield) reveal distinct patterns and relationships.
ManufacturingProcess32
,
ManufacturingProcess36
, and
ManufacturingProcess13
exhibit linear or slightly nonlinear
trends with Yield.ManufacturingProcess32
shows a positive
trend with Yield, suggesting that higher values in this process
parameter might correspond to increased yield.ManufacturingProcess13
and
ManufacturingProcess31
, on the other hand, display negative
trends, indicating a possible inverse relationship between these process
parameters and yield.BiologicalMaterial06
,
BiologicalMaterial03
, and BiologicalMaterial02
show varying degrees of correlation with Yield.BiologicalMaterial06
demonstrates a positive trend,
suggesting that its increased presence positively impacts the
yield.BiologicalMaterial03
is positive but less
pronounced, indicating a moderate association with yield.ManufacturingProcess32
, have positive effects,
others, like ManufacturingProcess13
, exhibit negative
trends. This suggests that the optimization of process parameters might
involve trade-offs or thresholds where the process becomes less
efficient.