##7.2
7.2. Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data: y = 10 sin(πx1x2) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N(0, σ2) where the x values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mlbench)
## Warning: package 'mlbench' was built under R version 4.3.3
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
library(caret)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
set.seed(200)
knnPred <- predict(knnModel, newdata = testData$x)
# The function 'postResample' can be used to get the test set
# perforamnce values
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Which models appear to give the best performance? Does MARS select the informative predictors (those named X1–X5)?
The Friedman dataset has informative predictors (X1X1​ to X5X5​) and non-informative predictors (X6X6​ to X10X10​). MARS, when trained on this data, performs feature selection by identifying these relationships and assigning higher importance to X1X1​ to X5X to see below
Fit a MARS
marsModel <- train(
x = trainingData$x,
y = trainingData$y,
method = "earth", # MARS model
tuneLength = 10
)
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.3.3
## Loading required package: Formula
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.3.3
## Loading required package: plotrix
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 82.92
## X2 64.47
## X5 40.67
## X3 28.65
## X6 0.00
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models. (a) Which nonlinear regression model gives the optimal resampling and test set performance?
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.3.3
data(ChemicalManufacturingProcess)
chemical <- ChemicalManufacturingProcess
# Imputation
preProc <- preProcess(chemical, method = "knnImpute")
chemical_imputed <- predict(preProc, chemical)
# Data splitting
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(chemical_imputed), replace = TRUE, prob = c(0.8, 0.2))
train <- chemical_imputed[sample, ]
test <- chemical_imputed[!sample, ]
# Pre-processing
preProc <- preProcess(train, method = c("center", "scale"))
train_preprocessed <- predict(preProc, train)
test_preprocessed <- predict(preProc, test)
# Train nonlinear models
set.seed(123)
models <- list()
# Random Forest
models$rf <- train(
Yield ~ .,
data = train_preprocessed,
method = "rf",
trControl = trainControl(method = "cv", number = 10),
metric = "RMSE"
)
# Gradient Boosting Machine
models$gbm <- train(
Yield ~ .,
data = train_preprocessed,
method = "gbm",
trControl = trainControl(method = "cv", number = 10),
verbose = FALSE,
metric = "RMSE"
)
# Support Vector Machines
models$svm <- train(
Yield ~ .,
data = train_preprocessed,
method = "svmRadial",
trControl = trainControl(method = "cv", number = 10),
metric = "RMSE"
)
# Compare model performance
resamples <- resamples(models)
summary(resamples)
##
## Call:
## summary.resamples(object = resamples)
##
## Models: rf, gbm, svm
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.3205756 0.4310047 0.4635988 0.4648531 0.5288804 0.5716890 0
## gbm 0.3363901 0.4297352 0.4426877 0.4592677 0.5169445 0.6084667 0
## svm 0.3436503 0.4701333 0.4930998 0.5155739 0.6222974 0.6764083 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.4277200 0.5715860 0.6126162 0.6041201 0.6436861 0.8156937 0
## gbm 0.4644638 0.5244978 0.5876787 0.5869634 0.6286567 0.7927376 0
## svm 0.4068963 0.5411064 0.5938343 0.6205820 0.7226760 0.8293966 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.5146884 0.5927702 0.6436740 0.6565916 0.7035083 0.8290242 0
## gbm 0.5428877 0.6194338 0.6667787 0.6856818 0.7683215 0.8254076 0
## svm 0.4816081 0.5890361 0.7014865 0.6809040 0.7856210 0.8186759 0
# Predict on test set for the best model
best_model <- models[[which.min(sapply(models, function(x) min(x$results$RMSE)))]]
test_predictions <- predict(best_model, test_preprocessed)
rmse_test <- sqrt(mean((test_predictions - test_preprocessed$Yield)^2))
cat("Test RMSE for the best model:", rmse_test, "\n")
## Test RMSE for the best model: 0.4386147
library(AppliedPredictiveModeling)
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
data("ChemicalManufacturingProcess")
# Preprocessing steps with new object names
dataPreProcessor <- preProcess(ChemicalManufacturingProcess,
method = c("BoxCox", "knnImpute", "center", "scale"))
processedData <- predict(dataPreProcessor, ChemicalManufacturingProcess)
processedData$Yield <- ChemicalManufacturingProcess$Yield
# Splitting into training and test sets with new object names
set.seed(123) # For reproducibility
splitIndex <- sample(seq_len(nrow(processedData)), size = floor(0.85 * nrow(processedData)))
trainingSet <- processedData[splitIndex, ]
testingSet <- processedData[-splitIndex, ]
list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
# Variable importance for the best model
importance <- varImp(best_model, scale = TRUE)
print(importance)
## gbm variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.000
## ManufacturingProcess13 28.353
## BiologicalMaterial03 26.084
## ManufacturingProcess09 19.061
## BiologicalMaterial12 15.785
## BiologicalMaterial09 13.687
## ManufacturingProcess17 10.344
## ManufacturingProcess06 9.620
## ManufacturingProcess30 9.372
## BiologicalMaterial05 8.398
## ManufacturingProcess31 8.309
## ManufacturingProcess01 8.134
## BiologicalMaterial11 7.311
## ManufacturingProcess20 6.857
## ManufacturingProcess04 6.287
## ManufacturingProcess14 5.942
## ManufacturingProcess05 5.408
## ManufacturingProcess26 5.168
## ManufacturingProcess37 4.835
## ManufacturingProcess45 4.785
# Plot variable importance
plot(importance, top = 10)
# Train linear regression model
linear_model <- train(
Yield ~ .,
data = train_preprocessed,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
metric = "RMSE"
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
# Extract variable importance for the linear model
linear_importance <- varImp(linear_model, scale = TRUE)
print(linear_importance)
## lm variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess33 61.85
## ManufacturingProcess04 61.42
## BiologicalMaterial05 46.38
## ManufacturingProcess28 39.39
## BiologicalMaterial02 36.94
## ManufacturingProcess09 36.50
## ManufacturingProcess37 33.80
## ManufacturingProcess45 33.07
## BiologicalMaterial11 32.87
## ManufacturingProcess29 30.12
## ManufacturingProcess12 29.77
## BiologicalMaterial07 29.45
## ManufacturingProcess23 25.02
## ManufacturingProcess38 24.99
## ManufacturingProcess35 21.34
## ManufacturingProcess36 21.27
## ManufacturingProcess07 20.81
## ManufacturingProcess03 19.21
## ManufacturingProcess43 18.67
# Plot variable importance for the linear model
plot(linear_importance, top = 10)
# Get top 10 predictors for nonlinear model
top_predictors_nonlinear <- importance$importance %>%
arrange(desc(Overall)) %>%
head(10)
# Get top 10 predictors for linear model
top_predictors_linear <- linear_importance$importance %>%
arrange(desc(Overall)) %>%
head(10)
# Compare top predictors
print("Top predictors from nonlinear model:")
## [1] "Top predictors from nonlinear model:"
print(top_predictors_nonlinear)
## Overall
## ManufacturingProcess32 100.000000
## ManufacturingProcess13 28.353130
## BiologicalMaterial03 26.083579
## ManufacturingProcess09 19.061239
## BiologicalMaterial12 15.784535
## BiologicalMaterial09 13.687353
## ManufacturingProcess17 10.343768
## ManufacturingProcess06 9.619833
## ManufacturingProcess30 9.371528
## BiologicalMaterial05 8.398010
print("Top predictors from linear model:")
## [1] "Top predictors from linear model:"
print(top_predictors_linear)
## Overall
## ManufacturingProcess32 100.00000
## ManufacturingProcess33 61.84879
## ManufacturingProcess04 61.42045
## BiologicalMaterial05 46.37958
## ManufacturingProcess28 39.38728
## BiologicalMaterial02 36.93659
## ManufacturingProcess09 36.50330
## ManufacturingProcess37 33.79848
## ManufacturingProcess45 33.07245
## BiologicalMaterial11 32.86803
# Analyze overlap
common_predictors <- intersect(
rownames(top_predictors_nonlinear),
rownames(top_predictors_linear)
)
cat("Common predictors between nonlinear and linear models:", common_predictors, "\n")
## Common predictors between nonlinear and linear models: ManufacturingProcess32 ManufacturingProcess09 BiologicalMaterial05
The most important variable is ManufacturingProcess32, with a scaled importance of 100. Process variables dominate the top positions (7 out of 10), suggesting that they contribute more to the optimal nonlinear model. Similar to the nonlinear model, ManufacturingProcess32 appears as the most important predictor, followed by other process variables like ManufacturingProcess13 and ManufacturingProcess09.Manufacturing Process variables are more dominant in both models, particularly in the nonlinear regression model, where they constitute the majority of the top predictors.
# Identify predictors unique to the nonlinear model
unique_predictors_nonlinear <- setdiff(
rownames(top_predictors_nonlinear),
rownames(top_predictors_linear)
)
print("Unique predictors to the nonlinear model:")
## [1] "Unique predictors to the nonlinear model:"
print(unique_predictors_nonlinear)
## [1] "ManufacturingProcess13" "BiologicalMaterial03" "BiologicalMaterial12"
## [4] "BiologicalMaterial09" "ManufacturingProcess17" "ManufacturingProcess06"
## [7] "ManufacturingProcess30"
# Scatter plots for unique nonlinear predictors
library(ggplot2)
for (predictor in unique_predictors_nonlinear) {
plot <- ggplot(chemical_imputed, aes_string(x = predictor, y = "Yield")) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", color = "blue") +
labs(title = paste("Relationship between", predictor, "and Yield"),
x = predictor, y = "Yield") +
theme_minimal()
print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Both biological and process variables show strong nonlinear effects on Yield.Process variables exhibit more complex relationships, such as U-shapes and thresholds, suggesting the nonlinear model captures nuanced operational interactions Biological variables contribute significantly but tend to have smoother, more predictable relationships with Yield. This analysis highlights the strength of the nonlinear model in capturing these intricate patterns, which could guide both operational adjustments and material selection to maximize production efficiency.